Normalization

ER modeling is a database design technique,

while Normalization is another database design technique.

Purpose of Normalization

  • begins by examining the relationships (called functional dependencies) between attributes.
  • uses a series of tests (called normal forms) to identify the optimal grouping of the attributes.
  • Remove repeating groups
  • Remove unwanted functional dependencies

Characteristics of a suitable set of relations include

  • attributes with a close logical relationship are found in the same relation;
  • minimal redundancy: each attribute is represented only once (except for the foreign keys).

Benefits of having a normalized database with well-designed relations

  • easier for users to access and maintain the data
  • less data duplications -> take up minimal storage space on the computer -> minimizing costs
  • reducing the opportunities for data inconsistencies

How Normalization Supports Database Design

Data Redundancy

Major aim of relational database design is to group attributes into relations to minimize data redundancy.

Update Anomalies

Relations that contain redundant information may potentially suffer from update anomalies.

Insertion anomaly

  • Cannot insert a record with null Primary Key.
  • Have to enter duplicate data
  • Problems with missing data
  • Cannot insert a value which is not in table column
  • Add a non-key value requires adding primary key value as well

Modification anomaly

  • Changing the particular attribute requires changing the other infos of that particular attribute.
  • Need to change multiple rows of table when update a value
  • An update of one attribute may have to be carried out on several tuples if information is duplicated

Deletion anomaly

  • Deleting a row that represents the last member of specific attribute will also remove all information of that attribute in the database.
  • Deletion of one element of data may cause other data (primary key) to be lost
  • Delete a row that represents the last member of staff in a branch will also remove all information of that branch in the database

Example

Anomalies occur when a table contains facts about two or more different themes.

Thats why we need Normalization.

  • Every normalized relation has a single theme
  • Normalization is to break up relation
  • Normalization resulting files show a high degree of data independence.

Functional Dependencies (FD)

Functional dependency (FD) is an important concept associated with normalization.

used to identify the primary key for a relation.

Definition of Functional dependency (FD)

  • Assume that A and B are attributes of a relation
  • if each value of A is associated with exactly one value of B,
  • B is functionally dependent on A (denoted A → B)
  • A is the determinant of B
  • Must be 1 to 1 relationship
  • 1 to Many is ok if the one is primary key

Partial Dependencies (PD)

  • B is functionally dependent on a subset of A is called Partial Dependencies.

    Eg: If {A,B} → {C} but also {A} → {C} then {C} is partially functionally dependent on {A,B}.

    e.g. {staffNo, sName} -> {branchNo}

    ​ but also {staffNo} -> {branchNo}

    Therefore {branchNo} is partially functionally dependent on {staffNo, sName}

Characteristics of Functional Dependencies

  • There is a one-to-one relationship between the attribute(s) on the left-hand side (determinant) and those on the right-hand side of a functional dependency.
  • Holds for all time.
  • The determinant has the minimal number of attributes necessary to maintain the dependency with the attribute(s) on the right hand-side

Transitive Dependencies

  • describes a condition where A, B, and C are attributes of a relation
    • such that if A → B and B → C, then C is transitively dependent on A via B
    • A→B, B→C, then A→C
      • e.g. orderNo and supplierNo and supplier-name
      • orderNo -> supplierNo, supplierNo-> supplier-name, Then orderNo -> supplier-name

It is important to recognize a transitive dependency

because its existence in a relation can potentially cause update anomalies

The Process of Normalization

Normalization is a formal technique for analyzing a relation based on its primary key and the functional dependencies between the attributes of that relation. Often executed as a series of steps. Each step corresponds to a specific normal form, which has known properties.

As normalization proceeds, the relations become progressively more restricted (stronger) in format and also less vulnerable to update anomalies.

  • resulting files show a high degree of data independence.

Unnormalized Form (UNF)

Basically: There are multivalued attributes or repeating groups

A table that contains one or more repeating groups.

Example:

Normalization : 1NF to 3NF

Normalization Tool

First Normal Form (1NF)

non-keys depend on key

Removed all repeating groups

  • Remove repeating groups by entering appropriate data into the empty rows.

Basically: No multivalued attributes or repeating groups

each row and column contains one and only one value.

Example:

Second Normal Form (2NF)

non-keys depend on the whole key

  • Removed partial dependencies

Basically: 1NF + No partial dependencies

1NF + each non-key attribute must be dependent on the 1 key only

Achieve By Spliting the table

To convert 1 NF to 2NF:

  1. Identify Primary key for the 1NF relation
  2. Identify the functional dependencies in the relation
  3. If partial dependencies exist on the primary key remove them by placing them in a new relation along with a copy of their determinant. (Split Table)

Example:

1NF form: {SalespersonNo, ProdNo} -> {everything}

Take away the Partial Dependencies (PD) of 2NF from the table. Form them as new table according to the primary key.

Note you need to keep the original copy of the primary keys in the original table. Then rename the original table.

Now you have 3 Tables. It is in 2NF without partial dependencies.

Third Normal Form (3NF)

non-keys depend on the whole key, nothing but the key

  • Removed transitive dependencies

Basically: 2NF + No transitive dependencies (Splitted Table)

1NF + 2NF + none of the non-primary-key attribute is transitively dependent on the primary key

  • No (non-key→non-key) via transitive dependence

To convert 2 NF to 3NF:

  1. Identify Primary key for the 2NF relation
  2. Identify the functional dependencies in the relation
  3. If transitive dependencies exist on the primary key remove them by placing them in a new relation along with a copy of their determinants. (Split Table)

Example:

Take the previous 2NF example.

We can see there is a nonkey-nonkey relation in the Salesperson table.

{SalespersonNo} -> {DeptNo} , {DeptNo} -> {ManagerName}

Therefore {SalespersonNo} -> {ManagerName} => Transitively dependent

The violating 3NF.

Take it away to form a new table, save the copy of original key as Foreign key in the original table.

Now you have 4 Tables. It is in 3NF without transitive dependencies.

Data Warehousing

To setup a DataWarehouse, all you need is to take all the data and make them consistant and put them in a nice location and nice format.

NOT IN 3NF (Big Data)

De-nomralising Data

  • The data should be de-normalised to 2NF.
  • Data Redundancy occur in 2NF
  • You need more storage

Why?

  • You can get the data more quickly
  • The purpose of a DataWarehouse is to provide aggregate data which is in a suitable format for decision making.

Business Intelligence (BI)

BI refers to the processes for

  • collecting and analysing data
  • with the purpose of facilitating corporate decision making.

In order to make the business better.

Data warehousing is one of the key technologies for implementing BI.

How?

BI Helps enterprise users make better business decision by:

  • Understanding the health of the organization
    • The whats and hows
  • Collaborating on the same view of data and business logic
  • Reducing the time to make decision (automation)

Challenges in Delivery

  • Large volume of related data
  • Variety of data formats
  • Location of data
  • Data are not easily queried
  • Data are not optimized for analytical queries
  • Data are not accessible
  • Data definitions are not consistent
  • Users may not have the skills, tools, or permission to query the data

Ideally, BI is

  • Empowerment – directly usable
  • Fast - responsive
  • Timely - available
  • Accurate – trusted with quality
  • Usable – has value

Relationship between BI and DW

OLTP VS OLAP

Online transaction processing (OLTP) Online analytical processing (OLAP)
designed to maximize transaction throughput designed to hold historical data for analyses
Data stored in normalized format Data stored in denormalized format
DB size is small (around 100MB to 100GB) DB size is huge (around 100GB to few TB)
Used by normal users Used by Decision maker
Slow query response if amount of data is very large Quicker query response, generate quicker reports

Data Warehousing Concepts

A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993).

The Goal of Data Warehousing

To integrate corporate data into a single repository from which users can easily run queries, produce reports, and perform analysis (for making business decisions).

Subject-Oriented

  • warehouse is organized around the major subjects of the enterprise
  • not organized around the business operations
  • This is reflected in the need to store decision-support data rather than application-oriented data.

Integrated Data

  • integrates corporate application- oriented data from different source systems
    • often includes data that is inconsistent (different format)
  • integrated data source must be made consistent
    • to present a unified view of the data to the users

Time Variant

  • data in a data warehouse is identified with a particular time period
  • data in data warehouse provide information from a historical point of view

Non-volatile Data

  • Data in the warehouse is not normally updated in real-time (RT)
  • It is refreshed from operational systems on a regular basis
  • New data is always added as a supplement to the database, rather than a replacement.

However, emerging trend is towards real-time or near real-time DWs

Benefits of Data Warehousing

  • Potential high returns on investment (ROI)
  • Competitive advantage
    • Better business decisions
    • Reveal of untapped information
  • Increased productivity of corporate decision- makers
    • allows corporate decision makers to perform more accurate and consistent analysis

OLTP VS Data Warehousing

  • DBMS built for online transaction processing (OLTP)
    • usually not suitable for data warehousing.

OLTP is designed to maximize transaction throughput.

DW is designed to support ad hoc queries for business decision making.

Therefore:

  • OLTP is for making transaction.
  • DW is for making decision.
OLTP Data Warehousing
Main purpose Support Operational processing Support analytical processing
Data age Current Historic
Data latency Real-time Depends on length of cycle for data supplements to warehouse
Data granularity Detailed data Detailed data, lightly and highly summarized data
Data processing Predictable pattern Less Predictable pattern
Reporting Predictable, relatively static fixed reporting Unpredictable, dynamic reporting
Users Serves large number of operational users Serves lower number of managerial users

Data Warehouse Architecture

Operational Data Sources

  • Main sources are online transaction processing (OLTP) databases.
  • include sources such as personal databases and spread sheets, and web usage log files.
  • Holds current and integrated operational data for analysis

ETL Manager

ETL stands for extraction, transformation, and loading.

ETL Manager do the extraction and loading of data into the warehouse

  • Data for a DW must be extracted from one or more data sources
  • transformed into a form that is easy to analyze
    • consistent with data already in the warehouse
  • then finally loaded into the DW

Warehouse Manager

Warehouse Manager manage data in the warehouse.

  • Analysis of data to ensure consistency
  • Transformation and merging of source data from temporary storage into data warehouse tables
  • Creation of indexes and views on base tables
  • Generation of denormalizations if necessary
  • Generation of aggregations if necessary
  • Backing-up and archiving data

Query Manager

Query Manager manage user queries.

  • Directing queries to the appropriate tables and scheduling the execution of queries
  • generates query profiles to allow the warehouse manager
    • to determine which indexes and aggregations are appropriate
    • for speeding up future queries

End-User Access Tools

End-user access tools include:

  • Traditional reporting and query
  • OLAP
  • Data mining

Main purpose of DW is to support decision makers.

This is achieved through the provision of a range of access tools including:

  • reporting and querying,
  • application and development,
  • Online Analytical Processing (OLAP)
  • Data mining

ETL and Data Marts

ETL (Extraction, Transformation and Loading)

  • E - Get the Data
  • T - Make to useful
  • L - Save it to the warehouse

Extraction

  • Targets one or more data sources
    • typically include OLTP databases
    • but can also include personal databases and spreadsheets,
    • Enterprise Resource Planning (ERP) files, and web usage log files.
  • data sources are normally internal
    • but can also include external sources
    • such as the systems used by suppliers and/or customers

Transformation

  • Applies a series of rules or functions to the extracted data
    • determines how the data will be used for analysis
    • can involve transformations such as
      • data summations
      • data encoding
      • data merging
      • data splitting
      • data calculations
      • creation of surrogate keys.

Loading

Data Loading involves:

  • sorting
  • summarizing
  • consolidating
  • checking integrity
  • building indices
  • partitions (for speeding up queries).

Data Marts (Sub-sets of the DataWarehouse)

  • Don’t mess with my data.
  • Keep it simple for the user.
  • Small problems are easier to solve.

Usage of Data Marts

  • give users access to the data they need to analyze most often
  • provide data in a form that matches the collective view of the data by a group of users in a department or business application area
  • improve end-user response time due to the reduction in the volume of data to be accessed
  • provide appropriately structured data as dictated by the requirements of the end-user access tools
  • Building a data mart is simpler compared with establishing an enterprise-wide DW (EDW)
  • The cost of implementing data marts is normally less than that required to establish a EDW
  • The future users of a data mart are more easily defined and targeted to obtain support for a data mart than an enterprise-wide data warehouse project

Data Marts vs Data Warehouse

Data Warehouse Data Mart
Scope Corporate Line of Business (LOB)
Subject Multiple Single subject
Data Sources Many Few
Size (typical) 100 GB-TB+ < 100 GB
Implementation Time Months to years Months

Data Warehousing Design

Design the Dimensional Model (star schema)

The design is to provide instantaneous query results for analysts.

A dimensional model has:

  • 1 fact table
    • In the middle of star schema is a Fact.
    • The fact table contains measures
  • a number of dimension tables
    • Dimension Tables contains textual descriptions

OLAP Cube

An OLAP cube stores data in an optimized way to provide a quick response to various types of complex queries by using dimensions and measures.

  • Once the cube is created, we may perform queries on it instead of on the original data sources.

Example:

How to Design a OLAP Cube

  1. Identify the attribute first (usually 3) -> Determine the dimensions
  2. Identify how many distinct data in the attribute -> Determine vector length of each dimension

MySQL: WITH ROLLUP

SQL “with rollup” modifier summaries output to include not only one extra row that represents higher-level summary operations, but also Multiple rows for multiple columns “with rollup”

Example:

Lets say we have a table sales.

1
2
3
4
5
6
7
CREATE TABLE sales
(
year INT,
country VARCHAR(20),
prodcut VARCHAR(32),
profit INT
);

modifier summaries output to include not only one extra row that represents higher-level summary operations

1
2
3
SELECT year, SUM(profit) AS profit
FROM sales
GROUP BY year ASC WITH ROLLUP

Output:

Multiple rows for multiple columns “with rollup”

1
2
3
SELECT year, SUM(profit) AS profit
FROM sales
GROUP BY year ASC, country ASC, product ASC WITH ROLLUP

Output:

Database Security

What Database Security

The mechanisms that protect the database against intentional or accidental threats.

Database security encompasses:

  • hardware
  • software
  • people
  • data

Why Care Database Security

A database represents an essential corporate resource that should be properly secured using appropriate controls.

We consider database security in relation to the following situations:

  • theft and fraud
  • loss of confidentiality (secrecy)
  • loss of privacy
  • loss of integrity
  • loss of availability

Theft and fraud

  • affect not only the database environment but also the entire organization
  • attention should focus on reducing the opportunities for this occurring

Loss of Confidentiality and Privacy

  • privacy refers to the need to protect data about individuals
  • loss of confidentiality could lead to loss of competitiveness
  • loss of privacy could lead to legal action being taken against the organization

Loss of Integrity and Availability

  • Loss of data integrity results in invalid or corrupted data (data corruption)
    • seriously affect the operation of an organization
  • Loss of availability means that the data, or the system, or both cannot be accessed
    • seriously affect an organization’s financial performance.

Threats

An organization should invest time and effort in identifying the most serious threats.

And then initiate appropriate plans and countermeasures.

Example of Threats:

Potential threats to computer systems

Data Security Best Practices

  • Principle of Least Privilege
  • Stay up-to-date on patches
  • Remove/disable unneeded default accounts
  • Firewalling/Access Control
  • Running Database processes under dedicated non-privileged account.
  • Password Security

Principle of Least Privilege

Do not give accounts privileges that aren’t needed.

Firewalling/Access Control

  • Throttling connections
    • Use firewall software like IPTables
    • Don’t let the world connect to your database server

Access Control in MySQL

To disable remote access to MySQL server, added the following statement to [mysqld] section of my.cnf (Linux/Mac) or my.ini (Windows):

1
skip-networking

Access Control to Error Log:

Ensure only “root” and “mysql” have access to the logfile <hostname>.err. The file is stored in the mysql data directory. This file contains very sensitive information such as passwords, addresses, table names, stored procedure names and code parts.

Password Security

  • Change password on a regular basis
  • Don’t use/reuse the same password
  • Categorize your password
  • Don’t use the same password for Gmail and banking
  • Avoid using words found in a dictionary

Make password more secure:

  • Make password long and complex
    • At least 8 characters
    • Use uppercase/lowercase, numbers, and special characters
  • Don’t use/reuse password iteration
  • Avoid using personal information
  • Never store passwords in obvious place
  • Don’t go to a secure site using public PC
  • Never give password to anybody
    • Aware of social engineering

Password Security in MySQL

To rename the root in the MySQL console (after login as root):

1
2
3
4
USE mysql;
UPDATE user SET user ='admin' WHERE user='root';
UPDATE user SET password=PASSWORD('new password') WHERE user='admins';
flush privileges;

Restart server after changing root username and password and you can login as admin user with full privilege.

The “LOCAL INFILE” command can be used to gain access to other files on the operating system, for instance “/etc/passwd” in Linux or C:\Windows\win.ini in Windows.

To disable "LOCAL INFILE”, add the following parameter to the [mysqld] section of the MySQL configuration file (my.ini):

1
set-variable=local-infile=0

SQL Injection

SQL injection is a technique where malicious users can inject SQL commands into an SQL statements via web page input.

A user input is used as part of the SQL statement.

  • SQL injection is a code injection technique that might destroy your database.
  • SQL injection is one of the most common web hacking techniques.
  • SQL injection is the placement of malicious code in SQL statements, via web page input.

Hacking with SQL Injection

  • Exploit the unsanitized field
  • Completes the single quote and closing bracket
  • Makes use of end symbol ; and add extra sql statement
    • Could be catastrophic, result in a trapdoor
  • Makes use of comment symbol # or -- or /* */
  • Could inject OR 1=1 for math input based on 1=1 is Always True
  • Could inject " or ""=" for string input based on “”=“” is Always True

Note:

Comment Syntax available in MySQL:

1
2
3
# This is a MySQL comment
-- This is a standard SQL comment
/* This is a multiline comment */

Injection with commenting

Example:

Original Code:

1
SELECT * from UserTbl WHERE username = 'fred' and password = 'xyz';

Where fred and xyz are the input.

Injection Input:

when a hacker know the username of a user and put the following string to the username field.

1
fred'--

Injected Code:

1
2
3
4
5
SELECT * FROM UserTbl WHERE username ='fred'--' and password = '';


-- This query is equivalent to:
select * from UserTbl where username = 'fred'

Injection based on 1=1 is Always True

Example:

Original Code:

1
SELECT ProdName FROM Product WHERE Stock=12;

Where 12 is the input.

Injection Input:

1
12 or 1=1

Injected Code:

1
2
3
4
SELECT ProdName FROM Product WHERE Stock=12 or 1=1;

-- This query is equivalent to:
SELECT ProdName FROM Product;

Makes use of end symbol ; and add extra sql statement

Example:

Original Code:

1
SELECT ProdName FROM Product WHERE Stock=12;

Where 12 is the input.

Injection Input:

1
12 or 1=1; SELECT * from Product;

Injected Code:

1
2
3
4
SELECT ProdName FROM Product WHERE Stock=12 or 1=1; SELECT * from Product;

-- This query is equivalent to:
SELECT * from Product;

Injection based on “”=“” is Always True

Example:

Original Code:

1
SELECT * FROM Customer WHERE FirstName = "Mary" AND LastName = "Hill";

Where Mary and Hill are the inputs.

Injection Input:

1
" or ""="

Injected Code:

1
2
3
4
SELECT * FROM Customer WHERE FirstName = "" or ""="" AND LastName = "" or ""="";

-- This query is equivalent to:
SELECT * FROM Customer;

Preventing SQL Injection

To prevent SQL injection, developers should

  • stop writing dynamic queries (Impossible)
  • prevent user supplied input which contains malicious SQL from affecting the logic of the executed query.

Prepared statements

It is a Defense Technique.

Keypoint: using parameterized queries.

use the user input as a parameter.

Prepared statements ensure that an attacker is not able to change the intent of a query, even if SQL commands are inserted by an attacker.

A small example on Java to prevent SQL injection:

1
2
3
4
5
String custNo = request.getParameter(“CustNo”);
String query = "SELECT * FROM Customer WHERE CustNo = ?"; //? indicate the parameter
PreparedStatement pstmt = conn.prepareStatement(query);
pstmt.setString(1, custNo); //first parameter as custNo
ResultSet res = pstmt.executeQuery();

The problem with SQL injection is, that a user input is used as part of the SQL statement. By using prepared statements you can force the user input to be handled as the content of a parameter (and not as a part of the SQL command).

But if you don’t use the user input as a parameter for your prepared statement but instead build your SQL command by joining strings together, you are still vulnerable to SQL injections even when using prepared statements.

Big Data

Big Data relates to data creation, storage, retrieval and analysis that is remarkable in terms of Volume, Velocity, and Variety. (3Vs of Big Data)

Big Data has multiple definitions. 3Vs, 5Vs, and even 42Vs

5Vs of Big Data

Volume

  • the incredible amounts of data generated each second
  • constantly-updated data feeds containing information
  • Terabytes to exabytes of existing data to process

Velocity

  • the speed at which vast amounts of data are being generated, collected and analyzed
  • generate massive log data in real-time
  • Streaming data, milliseconds to seconds to respond

Variety

  • the different types of data
  • not just numbers and strings, but even 3D data, audio and video
  • Structured, unstructured, text, multimedia

Veracity

  • the quality or trustworthiness of the data
  • Accuracy of data

Value

  • the worth of the data being extracted

Source of Big Data

  • Social media and networks
    • all of us are generating data
  • Scientific instruments
    • collecting all sorts of data
  • Mobile devices
    • tracking all objects all the time
  • Sensor technology and networks
    • Measuring all kinds of data

Innovation is no longer hindered by the ability to collect data but by the ability to manage, analyze, summarize, visualize, and discover knowledge from data in a timely and scalable manner.

Storing Big Data

Traditional SQL vs Unstructured

Traditional SQL Databases

Traditional SQL Databases comprise of relational tables with fixed schema.

ACID Rules of Traditional SQL Databases:

  • Atomic
    • A transaction is a logical unit of work which must be either completed with all of its data modifications, or none of them is performed.
  • Consistent
    • At the end of the transaction, all data must be left in a consistent state
  • Isolated
    • Modifications of data performed by a transaction must be independent of another transaction.
  • Durable
    • When the transaction is completed, effects of the modifications performed by the transaction must be permanent in the system.

Disadvantage of Traditional SQL Databases:

  • very complicated to Join tables that are distributed across multiple machines
  • Difficult to deploy a big database on a distributed network
    • because dividing a relational database across multiple machines is nontrivial.
  • Maintaining a state of data consistency can become tricky on a distributed database
  • not designed to store a huge amount of data
  • when data volume grows, reading from and writing to a relational DB while maintaining ACID compliance can be computationally expensive

Structured Database (SQL Databases)

  • can be easily organized, stored and updated in a defined data model
  • e.g. numbers/text
  • have a consistent format (schema)
  • data can be processed, searched, queried, combined (joined)
    • analyzed straightforwardly using SQL

###Unstructured Databases

  • do not have a defined data model or common format
  • massive datasets consist of unstructured data

Why need it:

  • Too much data
  • growing fast data volume
  • not consistent
  • critical time

To tackle above problems, Unstructured Databases and NoSQL is introduced.

NoSQL (Not only SQL)

  • Non-relational database management systems (no JOIN operation)
  • Designed for storing large-scale unstructured data across multiple servers (nodes)
  • Type of data storing may not require fixed schema
  • Used by many IT giants: Google, Facebook, Linkedln, etc.
  • A database infrastructure well-adapted to big data, easy to scale out

We need NoSQL because of the huge data growth.

NoSQL vs RDBMS

NoSQL vs SQL

NoSQL SQL (RDBMS)
if most data is correct everything is ok Integrity is mission-critical
data format unknown or inconsistent data format consistent and well-defined
data will be replaced data is of long-term value
write-once, read multiple frequent data updates
Unpredictable / exponential growth predictable, linear growth
only programmers writing queries non-programmers writing queries
Replication Regular backup
Sharding access through master server
Key-value pair storage, doc store data stored in separated tables
ACID not guaranteed Follow the ACID rules
Highly scalable Not easy to scale

5 Advantages of NoSQL

  • Elastic Scaling
  • Good for Big Data
  • Require Less Management
  • Economics
  • Flexible Data Models

Key/Value Stores

A key-value store looks like a big hash table in which each record in the table is an object identified by a unique key.

  • The system does not know anything about the data being stored.
  • records are addressed by unique keys.
  • retrieve information based on a single key.

Examples: LinkedIn’s Voldemort, Apache Cassandra

Document Stores

Similar to key-value stores but with a major difference:

  • values are documents

Document stores have flexible schema in that any number of fields can be added to a document without affecting the other documents.

Unlike key-value stores where only query against the keys is allowed, document stores allow query against documents.

Example implementation: CouchDB, MongoDB

MapReduce

a programming framework for writing applications that process vast amounts of data (multi-terabyte) in-parallel on a large cluster (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

  • An Algorithm or concept to process Huge amount of data in a faster way
    • Divide it, then Map and Reduce
  • MapReduce opts for scaling out rather than scaling up

    • Scale Out: Adding more low-end servers as need.

    • Scale Up: Replace the existing server by a bigger (more expensive) one when demand is high

      Scale out is much cheaper

MapReduce provides

  • Automatic parallelization and distribution
  • Fault-tolerance
  • Status and monitoring

MapReduce tasks

MapReduce applications have two separate and distinct tasks:

Map and Reduce.

  • Mapper
    • Takes input data and processes it to produce key/value pairs
    • Each mapper works in each input files

The key/value pairs are then fed to the Reducer.

  • Reducer
    • Takes those key/values pairs and then combines or aggregates them to produce the final result (a single value).

Hadoop

Hadoop is an implementation of the MapReduce framework.

  • Hadoop is open source, but Google MapReduce is not.

Hadoop is implemented in Java.

It allows programmers to write parallel programs that execute on a cluster with thousands of nodes (servers).

Hadoop consists of :

  • Hadoop Common package, which provides file system and OS level abstractions;
  • a MapReduce engine
  • the Hadoop Distributed File System (HDFS).

Hadoop Distributed File System (HDFS)

  • Files split into 64MB or 128MB blocks
  • Blocks are replicated across nodes
  • Namenodes (masters) store filenames and locations.
  • Datanodes (workers) store data blocks

Flow of Data in MapReduce

  1. Several mappers read the input from HDFS (discussed later), each taking a portion
  2. The mappers apply the map function and write the output to disk.
  3. Optionally, combiners partially reduce mapper output.
  4. The immediate data is transferred across the network from mappers to reducers and stored to disk.
  5. Each reducer groups intermediate values together by keys
  6. The reducers read the sorted data from disk and apply the reduce function
  7. Output is written to disk

Hadoop vs RDBMS

Hadoop Traditional RDBMS
Data size Petabytes (10^15 or 1024^5) Gigabytes
Access Batch Interactive and batch
Update Write once, read many times Read/Write many times
Data Type Unstructured (dynamic schema) Structured (static schema)
Integrity Low High
Scaling Scale out Scale up

Hadoop:

  • Affordable Storage / Computer
  • Structured or Unstructured
  • Resilient Auto Scalability

Relational Databases:

  • Interactive response times
  • ACID
  • Structured data
  • Cost/Scale prohibitive

NoSQL on Hadoop

  • NoSQL databases can be built on top on a Hadoop cluster.
  • HBase is a NoSQL database that runs on top a Hadoop cluster and provides random real-time read/write access to data.
  • When you want to randomly access key-value stores, you use HBase
  • When you want to batch process a large volume of data in parallel, you use Hadoop which provides HDFS and implements the MapReduce processing framework