Database System [3] - Data Warehousing and Big Data
Normalization
ER modeling is a database design technique,
while Normalization is another database design technique.
Purpose of Normalization
- begins by examining the relationships (called functional dependencies) between attributes.
- uses a series of tests (called normal forms) to identify the optimal grouping of the attributes.
- Remove repeating groups
- Remove unwanted functional dependencies
Characteristics of a suitable set of relations include
- attributes with a close logical relationship are found in the same relation;
- minimal redundancy: each attribute is represented only once (except for the foreign keys).
Benefits of having a normalized database with well-designed relations
- easier for users to access and maintain the data
- less data duplications -> take up minimal storage space on the computer -> minimizing costs
- reducing the opportunities for data inconsistencies
How Normalization Supports Database Design
Data Redundancy
Major aim of relational database design is to group attributes into relations to minimize data redundancy.
Update Anomalies
Relations that contain redundant information may potentially suffer from update anomalies.
Insertion anomaly
- Cannot insert a record with null Primary Key.
- Have to enter duplicate data
- Problems with missing data
- Cannot insert a value which is not in table column
- Add a non-key value requires adding primary key value as well
Modification anomaly
- Changing the particular attribute requires changing the other infos of that particular attribute.
- Need to change multiple rows of table when update a value
- An update of one attribute may have to be carried out on several tuples if information is duplicated
Deletion anomaly
- Deleting a row that represents the last member of specific attribute will also remove all information of that attribute in the database.
- Deletion of one element of data may cause other data (primary key) to be lost
- Delete a row that represents the last member of staff in a branch will also remove all information of that branch in the database
Example
Anomalies occur when a table contains facts about two or more different themes.
Thats why we need Normalization.
- Every normalized relation has a single theme
- Normalization is to break up relation
- Normalization resulting files show a high degree of data independence.
Functional Dependencies (FD)
Functional dependency (FD) is an important concept associated with normalization.
used to identify the primary key for a relation.
Definition of Functional dependency (FD)
- Assume that A and B are attributes of a relation
- if each value of A is associated with exactly one value of B,
- B is functionally dependent on A (denoted A → B)
- A is the determinant of B
- Must be 1 to 1 relationship
- 1 to Many is ok if the one is primary key
Partial Dependencies (PD)
-
B is functionally dependent on a subset of A is called Partial Dependencies.
Eg: If {A,B} → {C} but also {A} → {C} then {C} is partially functionally dependent on {A,B}.
e.g. {staffNo, sName} -> {branchNo}
but also {staffNo} -> {branchNo}
Therefore {branchNo} is partially functionally dependent on {staffNo, sName}
Characteristics of Functional Dependencies
- There is a one-to-one relationship between the attribute(s) on the left-hand side (determinant) and those on the right-hand side of a functional dependency.
- Holds for all time.
- The determinant has the minimal number of attributes necessary to maintain the dependency with the attribute(s) on the right hand-side
Transitive Dependencies
- describes a condition where A, B, and C are attributes of a relation
- such that if A → B and B → C, then C is transitively dependent on A via B
- A→B, B→C, then A→C
- e.g. orderNo and supplierNo and supplier-name
- orderNo -> supplierNo, supplierNo-> supplier-name, Then orderNo -> supplier-name
It is important to recognize a transitive dependency
because its existence in a relation can potentially cause update anomalies
The Process of Normalization
Normalization is a formal technique for analyzing a relation based on its primary key and the functional dependencies between the attributes of that relation. Often executed as a series of steps. Each step corresponds to a specific normal form, which has known properties.
As normalization proceeds, the relations become progressively more restricted (stronger) in format and also less vulnerable to update anomalies.
- resulting files show a high degree of data independence.
Unnormalized Form (UNF)
Basically: There are multivalued attributes or repeating groups
A table that contains one or more repeating groups.
Example:
Normalization : 1NF to 3NF
First Normal Form (1NF)
non-keys depend on key
Removed all repeating groups
- Remove repeating groups by entering appropriate data into the empty rows.
Basically: No multivalued attributes or repeating groups
each row and column contains one and only one value.
Example:
Second Normal Form (2NF)
non-keys depend on the whole key
- Removed partial dependencies
Basically: 1NF + No partial dependencies
1NF + each non-key attribute must be dependent on the 1 key only
Achieve By Spliting the table
To convert 1 NF to 2NF:
- Identify Primary key for the 1NF relation
- Identify the functional dependencies in the relation
- If partial dependencies exist on the primary key remove them by placing them in a new relation along with a copy of their determinant. (Split Table)
Example:
1NF form: {SalespersonNo, ProdNo} -> {everything}
Take away the Partial Dependencies (PD) of 2NF from the table. Form them as new table according to the primary key.
Note you need to keep the original copy of the primary keys in the original table. Then rename the original table.
Now you have 3 Tables. It is in 2NF without partial dependencies.
Third Normal Form (3NF)
non-keys depend on the whole key, nothing but the key
- Removed transitive dependencies
Basically: 2NF + No transitive dependencies (Splitted Table)
1NF + 2NF + none of the non-primary-key attribute is transitively dependent on the primary key
- No (non-key→non-key) via transitive dependence
To convert 2 NF to 3NF:
- Identify Primary key for the 2NF relation
- Identify the functional dependencies in the relation
- If transitive dependencies exist on the primary key remove them by placing them in a new relation along with a copy of their determinants. (Split Table)
Example:
Take the previous 2NF example.
We can see there is a nonkey-nonkey relation in the Salesperson table.
{SalespersonNo} -> {DeptNo} , {DeptNo} -> {ManagerName}
Therefore {SalespersonNo} -> {ManagerName} => Transitively dependent
The violating 3NF.
Take it away to form a new table, save the copy of original key as Foreign key in the original table.
Now you have 4 Tables. It is in 3NF without transitive dependencies.
Data Warehousing
To setup a DataWarehouse, all you need is to take all the data and make them consistant and put them in a nice location and nice format.
NOT IN 3NF (Big Data)
De-nomralising Data
- The data should be de-normalised to 2NF.
- Data Redundancy occur in 2NF
- You need more storage
Why?
- You can get the data more quickly
- The purpose of a DataWarehouse is to provide aggregate data which is in a suitable format for decision making.
Business Intelligence (BI)
BI refers to the processes for
- collecting and analysing data
- with the purpose of facilitating corporate decision making.
In order to make the business better.
Data warehousing is one of the key technologies for implementing BI.
How?
BI Helps enterprise users make better business decision by:
- Understanding the health of the organization
- The whats and hows
- Collaborating on the same view of data and business logic
- Reducing the time to make decision (automation)
Challenges in Delivery
- Large volume of related data
- Variety of data formats
- Location of data
- Data are not easily queried
- Data are not optimized for analytical queries
- Data are not accessible
- Data definitions are not consistent
- Users may not have the skills, tools, or permission to query the data
Ideally, BI is
- Empowerment – directly usable
- Fast - responsive
- Timely - available
- Accurate – trusted with quality
- Usable – has value
Relationship between BI and DW
OLTP VS OLAP
Online transaction processing (OLTP) | Online analytical processing (OLAP) |
---|---|
designed to maximize transaction throughput | designed to hold historical data for analyses |
Data stored in normalized format | Data stored in denormalized format |
DB size is small (around 100MB to 100GB) | DB size is huge (around 100GB to few TB) |
Used by normal users | Used by Decision maker |
Slow query response if amount of data is very large | Quicker query response, generate quicker reports |
Data Warehousing Concepts
A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993).
The Goal of Data Warehousing
To integrate corporate data into a single repository from which users can easily run queries, produce reports, and perform analysis (for making business decisions).
Subject-Oriented
- warehouse is organized around the major subjects of the enterprise
- not organized around the business operations
- This is reflected in the need to store decision-support data rather than application-oriented data.
Integrated Data
- integrates corporate application- oriented data from different source systems
- often includes data that is inconsistent (different format)
- integrated data source must be made consistent
- to present a unified view of the data to the users
Time Variant
- data in a data warehouse is identified with a particular time period
- data in data warehouse provide information from a historical point of view
Non-volatile Data
- Data in the warehouse is not normally updated in real-time (RT)
- It is refreshed from operational systems on a regular basis
- New data is always added as a supplement to the database, rather than a replacement.
However, emerging trend is towards real-time or near real-time DWs
Benefits of Data Warehousing
- Potential high returns on investment (ROI)
- Competitive advantage
- Better business decisions
- Reveal of untapped information
- Increased productivity of corporate decision- makers
- allows corporate decision makers to perform more accurate and consistent analysis
OLTP VS Data Warehousing
- DBMS built for online transaction processing (OLTP)
- usually not suitable for data warehousing.
OLTP is designed to maximize transaction throughput.
DW is designed to support ad hoc queries for business decision making.
Therefore:
- OLTP is for making transaction.
- DW is for making decision.
OLTP | Data Warehousing | |
---|---|---|
Main purpose | Support Operational processing | Support analytical processing |
Data age | Current | Historic |
Data latency | Real-time | Depends on length of cycle for data supplements to warehouse |
Data granularity | Detailed data | Detailed data, lightly and highly summarized data |
Data processing | Predictable pattern | Less Predictable pattern |
Reporting | Predictable, relatively static fixed reporting | Unpredictable, dynamic reporting |
Users | Serves large number of operational users | Serves lower number of managerial users |
Data Warehouse Architecture
Operational Data Sources
- Main sources are online transaction processing (OLTP) databases.
- include sources such as personal databases and spread sheets, and web usage log files.
- Holds current and integrated operational data for analysis
ETL Manager
ETL stands for extraction, transformation, and loading.
ETL Manager do the extraction and loading of data into the warehouse
- Data for a DW must be extracted from one or more data sources
- transformed into a form that is easy to analyze
- consistent with data already in the warehouse
- then finally loaded into the DW
Warehouse Manager
Warehouse Manager manage data in the warehouse.
- Analysis of data to ensure consistency
- Transformation and merging of source data from temporary storage into data warehouse tables
- Creation of indexes and views on base tables
- Generation of denormalizations if necessary
- Generation of aggregations if necessary
- Backing-up and archiving data
Query Manager
Query Manager manage user queries.
- Directing queries to the appropriate tables and scheduling the execution of queries
- generates query profiles to allow the warehouse manager
- to determine which indexes and aggregations are appropriate
- for speeding up future queries
End-User Access Tools
End-user access tools include:
- Traditional reporting and query
- OLAP
- Data mining
Main purpose of DW is to support decision makers.
This is achieved through the provision of a range of access tools including:
- reporting and querying,
- application and development,
- Online Analytical Processing (OLAP)
- Data mining
ETL and Data Marts
ETL (Extraction, Transformation and Loading)
- E - Get the Data
- T - Make to useful
- L - Save it to the warehouse
Extraction
- Targets one or more data sources
- typically include OLTP databases
- but can also include personal databases and spreadsheets,
- Enterprise Resource Planning (ERP) files, and web usage log files.
- data sources are normally internal
- but can also include external sources
- such as the systems used by suppliers and/or customers
Transformation
- Applies a series of rules or functions to the extracted data
- determines how the data will be used for analysis
- can involve transformations such as
- data summations
- data encoding
- data merging
- data splitting
- data calculations
- creation of surrogate keys.
Loading
Data Loading involves:
- sorting
- summarizing
- consolidating
- checking integrity
- building indices
- partitions (for speeding up queries).
Data Marts (Sub-sets of the DataWarehouse)
- Don’t mess with my data.
- Keep it simple for the user.
- Small problems are easier to solve.
Usage of Data Marts
- give users access to the data they need to analyze most often
- provide data in a form that matches the collective view of the data by a group of users in a department or business application area
- improve end-user response time due to the reduction in the volume of data to be accessed
- provide appropriately structured data as dictated by the requirements of the end-user access tools
- Building a data mart is simpler compared with establishing an enterprise-wide DW (EDW)
- The cost of implementing data marts is normally less than that required to establish a EDW
- The future users of a data mart are more easily defined and targeted to obtain support for a data mart than an enterprise-wide data warehouse project
Data Marts vs Data Warehouse
Data Warehouse | Data Mart | |
---|---|---|
Scope | Corporate | Line of Business (LOB) |
Subject | Multiple | Single subject |
Data Sources | Many | Few |
Size (typical) | 100 GB-TB+ | < 100 GB |
Implementation Time | Months to years | Months |
Data Warehousing Design
Design the Dimensional Model (star schema)
The design is to provide instantaneous query results for analysts.
A dimensional model has:
- 1 fact table
- In the middle of star schema is a Fact.
- The fact table contains measures
- a number of dimension tables
- Dimension Tables contains textual descriptions
OLAP Cube
An OLAP cube stores data in an optimized way to provide a quick response to various types of complex queries by using dimensions and measures.
- Once the cube is created, we may perform queries on it instead of on the original data sources.
Example:
How to Design a OLAP Cube
- Identify the attribute first (usually 3) -> Determine the dimensions
- Identify how many distinct data in the attribute -> Determine vector length of each dimension
MySQL: WITH ROLLUP
SQL “with rollup” modifier summaries output to include not only one extra row that represents higher-level summary operations, but also Multiple rows for multiple columns “with rollup”
Example:
Lets say we have a table sales
.
1 | CREATE TABLE sales |
modifier summaries output to include not only one extra row that represents higher-level summary operations
1 | SELECT year, SUM(profit) AS profit |
Output:
Multiple rows for multiple columns “with rollup”
1 | SELECT year, SUM(profit) AS profit |
Output:
Database Security
What Database Security
The mechanisms that protect the database against intentional or accidental threats.
Database security encompasses:
- hardware
- software
- people
- data
Why Care Database Security
A database represents an essential corporate resource that should be properly secured using appropriate controls.
We consider database security in relation to the following situations:
- theft and fraud
- loss of confidentiality (secrecy)
- loss of privacy
- loss of integrity
- loss of availability
Theft and fraud
- affect not only the database environment but also the entire organization
- attention should focus on reducing the opportunities for this occurring
Loss of Confidentiality and Privacy
- privacy refers to the need to protect data about individuals
- loss of confidentiality could lead to loss of competitiveness
- loss of privacy could lead to legal action being taken against the organization
Loss of Integrity and Availability
- Loss of data integrity results in invalid or corrupted data (data corruption)
- seriously affect the operation of an organization
- Loss of availability means that the data, or the system, or both cannot be accessed
- seriously affect an organization’s financial performance.
Threats
An organization should invest time and effort in identifying the most serious threats.
And then initiate appropriate plans and countermeasures.
Example of Threats:
Potential threats to computer systems
Data Security Best Practices
- Principle of Least Privilege
- Stay up-to-date on patches
- Remove/disable unneeded default accounts
- Firewalling/Access Control
- Running Database processes under dedicated non-privileged account.
- Password Security
Principle of Least Privilege
Do not give accounts privileges that aren’t needed.
Firewalling/Access Control
- Throttling connections
- Use firewall software like IPTables
- Don’t let the world connect to your database server
Access Control in MySQL
To disable remote access to MySQL server, added the following statement to [mysqld] section of my.cnf (Linux/Mac) or my.ini (Windows):
1 | skip-networking |
Access Control to Error Log:
Ensure only “root” and “mysql” have access to the logfile <hostname>.err
. The file is stored in the mysql data directory. This file contains very sensitive information such as passwords, addresses, table names, stored procedure names and code parts.
Password Security
- Change password on a regular basis
- Don’t use/reuse the same password
- Categorize your password
- Don’t use the same password for Gmail and banking
- Avoid using words found in a dictionary
Make password more secure:
- Make password long and complex
- At least 8 characters
- Use uppercase/lowercase, numbers, and special characters
- Don’t use/reuse password iteration
- Avoid using personal information
- Never store passwords in obvious place
- Don’t go to a secure site using public PC
- Never give password to anybody
- Aware of social engineering
Password Security in MySQL
To rename the root in the MySQL console (after login as root):
1 | USE mysql; |
Restart server after changing root username and password and you can login as admin user with full privilege.
The “LOCAL INFILE” command can be used to gain access to other files on the operating system, for instance “/etc/passwd” in Linux or C:\Windows\win.ini in Windows.
To disable "LOCAL INFILE”, add the following parameter to the [mysqld] section of the MySQL configuration file (my.ini):
1 | set-variable=local-infile=0 |
SQL Injection
SQL injection is a technique where malicious users can inject SQL commands into an SQL statements via web page input.
A user input is used as part of the SQL statement.
- SQL injection is a code injection technique that might destroy your database.
- SQL injection is one of the most common web hacking techniques.
- SQL injection is the placement of malicious code in SQL statements, via web page input.
Hacking with SQL Injection
- Exploit the unsanitized field
- Completes the single quote and closing bracket
- Makes use of end symbol
;
and add extra sql statement- Could be catastrophic, result in a trapdoor
- Makes use of comment symbol
#
or--
or/* */
- Could inject
OR 1=1
for math input based on 1=1 is Always True - Could inject
" or ""="
for string input based on “”=“” is Always True
Note:
Comment Syntax available in MySQL:
1
2
3 # This is a MySQL comment
-- This is a standard SQL comment
/* This is a multiline comment */
Injection with commenting
Example:
Original Code:
1 | SELECT * from UserTbl WHERE username = 'fred' and password = 'xyz'; |
Where fred
and xyz
are the input.
Injection Input:
when a hacker know the username of a user and put the following string to the username field.
1 | fred'-- |
Injected Code:
1 | SELECT * FROM UserTbl WHERE username ='fred'--' and password = ''; |
Injection based on 1=1 is Always True
Example:
Original Code:
1 | SELECT ProdName FROM Product WHERE Stock=12; |
Where 12
is the input.
Injection Input:
1 | 12 or 1=1 |
Injected Code:
1 | SELECT ProdName FROM Product WHERE Stock=12 or 1=1; |
Makes use of end symbol ;
and add extra sql statement
Example:
Original Code:
1 | SELECT ProdName FROM Product WHERE Stock=12; |
Where 12
is the input.
Injection Input:
1 | 12 or 1=1; SELECT * from Product; |
Injected Code:
1 | SELECT ProdName FROM Product WHERE Stock=12 or 1=1; SELECT * from Product; |
Injection based on “”=“” is Always True
Example:
Original Code:
1 | SELECT * FROM Customer WHERE FirstName = "Mary" AND LastName = "Hill"; |
Where Mary
and Hill
are the inputs.
Injection Input:
1 | " or ""=" |
Injected Code:
1 | SELECT * FROM Customer WHERE FirstName = "" or ""="" AND LastName = "" or ""=""; |
Preventing SQL Injection
To prevent SQL injection, developers should
- stop writing dynamic queries (Impossible)
- prevent user supplied input which contains malicious SQL from affecting the logic of the executed query.
Prepared statements
It is a Defense Technique.
Keypoint: using parameterized queries.
use the user input as a parameter.
Prepared statements ensure that an attacker is not able to change the intent of a query, even if SQL commands are inserted by an attacker.
A small example on Java to prevent SQL injection:
1 | String custNo = request.getParameter(“CustNo”); |
The problem with SQL injection is, that a user input is used as part of the SQL statement. By using prepared statements you can force the user input to be handled as the content of a parameter (and not as a part of the SQL command).
But if you don’t use the user input as a parameter for your prepared statement but instead build your SQL command by joining strings together, you are still vulnerable to SQL injections even when using prepared statements.
Big Data
Big Data relates to data creation, storage, retrieval and analysis that is remarkable in terms of Volume, Velocity, and Variety. (3Vs of Big Data)
Big Data has multiple definitions. 3Vs, 5Vs, and even 42Vs
5Vs of Big Data
Volume
- the incredible amounts of data generated each second
- constantly-updated data feeds containing information
- Terabytes to exabytes of existing data to process
Velocity
- the speed at which vast amounts of data are being generated, collected and analyzed
- generate massive log data in real-time
- Streaming data, milliseconds to seconds to respond
Variety
- the different types of data
- not just numbers and strings, but even 3D data, audio and video
- Structured, unstructured, text, multimedia
Veracity
- the quality or trustworthiness of the data
- Accuracy of data
Value
- the worth of the data being extracted
Source of Big Data
- Social media and networks
- all of us are generating data
- Scientific instruments
- collecting all sorts of data
- Mobile devices
- tracking all objects all the time
- Sensor technology and networks
- Measuring all kinds of data
Innovation is no longer hindered by the ability to collect data but by the ability to manage, analyze, summarize, visualize, and discover knowledge from data in a timely and scalable manner.
Storing Big Data
Traditional SQL vs Unstructured
Traditional SQL Databases
Traditional SQL Databases comprise of relational tables with fixed schema.
ACID Rules of Traditional SQL Databases:
- Atomic
- A transaction is a logical unit of work which must be either completed with all of its data modifications, or none of them is performed.
- Consistent
- At the end of the transaction, all data must be left in a consistent state
- Isolated
- Modifications of data performed by a transaction must be independent of another transaction.
- Durable
- When the transaction is completed, effects of the modifications performed by the transaction must be permanent in the system.
Disadvantage of Traditional SQL Databases:
- very complicated to Join tables that are distributed across multiple machines
- Difficult to deploy a big database on a distributed network
- because dividing a relational database across multiple machines is nontrivial.
- Maintaining a state of data consistency can become tricky on a distributed database
- not designed to store a huge amount of data
- when data volume grows, reading from and writing to a relational DB while maintaining ACID compliance can be computationally expensive
Structured Database (SQL Databases)
- can be easily organized, stored and updated in a defined data model
- e.g. numbers/text
- have a consistent format (schema)
- data can be processed, searched, queried, combined (joined)
- analyzed straightforwardly using SQL
###Unstructured Databases
- do not have a defined data model or common format
- massive datasets consist of unstructured data
Why need it:
- Too much data
- growing fast data volume
- not consistent
- critical time
To tackle above problems, Unstructured Databases and NoSQL is introduced.
NoSQL (Not only SQL)
- Non-relational database management systems (no JOIN operation)
- Designed for storing large-scale unstructured data across multiple servers (nodes)
- Type of data storing may not require fixed schema
- Used by many IT giants: Google, Facebook, Linkedln, etc.
- A database infrastructure well-adapted to big data, easy to scale out
We need NoSQL because of the huge data growth.
NoSQL vs RDBMS
NoSQL vs SQL
NoSQL | SQL (RDBMS) |
---|---|
if most data is correct everything is ok | Integrity is mission-critical |
data format unknown or inconsistent | data format consistent and well-defined |
data will be replaced | data is of long-term value |
write-once, read multiple | frequent data updates |
Unpredictable / exponential growth | predictable, linear growth |
only programmers writing queries | non-programmers writing queries |
Replication | Regular backup |
Sharding | access through master server |
Key-value pair storage, doc store | data stored in separated tables |
ACID not guaranteed | Follow the ACID rules |
Highly scalable | Not easy to scale |
5 Advantages of NoSQL
- Elastic Scaling
- Good for Big Data
- Require Less Management
- Economics
- Flexible Data Models
Key/Value Stores
A key-value store looks like a big hash table in which each record in the table is an object identified by a unique key.
- The system does not know anything about the data being stored.
- records are addressed by unique keys.
- retrieve information based on a single key.
Examples: LinkedIn’s Voldemort, Apache Cassandra
Document Stores
Similar to key-value stores but with a major difference:
- values are documents
Document stores have flexible schema in that any number of fields can be added to a document without affecting the other documents.
Unlike key-value stores where only query against the keys is allowed, document stores allow query against documents.
Example implementation: CouchDB, MongoDB
MapReduce
a programming framework for writing applications that process vast amounts of data (multi-terabyte) in-parallel on a large cluster (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
- An Algorithm or concept to process Huge amount of data in a faster way
- Divide it, then Map and Reduce
-
MapReduce opts for scaling out rather than scaling up
-
Scale Out: Adding more low-end servers as need.
-
Scale Up: Replace the existing server by a bigger (more expensive) one when demand is high
Scale out is much cheaper
-
MapReduce provides
- Automatic parallelization and distribution
- Fault-tolerance
- Status and monitoring
MapReduce tasks
MapReduce applications have two separate and distinct tasks:
Map and Reduce.
- Mapper
- Takes input data and processes it to produce key/value pairs
- Each mapper works in each input files
The key/value pairs are then fed to the Reducer.
- Reducer
- Takes those key/values pairs and then combines or aggregates them to produce the final result (a single value).
Hadoop
Hadoop is an implementation of the MapReduce framework.
- Hadoop is open source, but Google MapReduce is not.
Hadoop is implemented in Java.
It allows programmers to write parallel programs that execute on a cluster with thousands of nodes (servers).
Hadoop consists of :
- Hadoop Common package, which provides file system and OS level abstractions;
- a MapReduce engine
- the Hadoop Distributed File System (HDFS).
Hadoop Distributed File System (HDFS)
- Files split into 64MB or 128MB blocks
- Blocks are replicated across nodes
- Namenodes (masters) store filenames and locations.
- Datanodes (workers) store data blocks
Flow of Data in MapReduce
- Several mappers read the input from HDFS (discussed later), each taking a portion
- The mappers apply the map function and write the output to disk.
- Optionally, combiners partially reduce mapper output.
- The immediate data is transferred across the network from mappers to reducers and stored to disk.
- Each reducer groups intermediate values together by keys
- The reducers read the sorted data from disk and apply the reduce function
- Output is written to disk
Hadoop vs RDBMS
Hadoop | Traditional RDBMS | |
---|---|---|
Data size | Petabytes (10^15 or 1024^5) | Gigabytes |
Access | Batch | Interactive and batch |
Update | Write once, read many times | Read/Write many times |
Data Type | Unstructured (dynamic schema) | Structured (static schema) |
Integrity | Low | High |
Scaling | Scale out | Scale up |
Hadoop:
- Affordable Storage / Computer
- Structured or Unstructured
- Resilient Auto Scalability
Relational Databases:
- Interactive response times
- ACID
- Structured data
- Cost/Scale prohibitive
NoSQL on Hadoop
- NoSQL databases can be built on top on a Hadoop cluster.
- HBase is a NoSQL database that runs on top a Hadoop cluster and provides random real-time read/write access to data.
- When you want to randomly access key-value stores, you use HBase
- When you want to batch process a large volume of data in parallel, you use Hadoop which provides HDFS and implements the MapReduce processing framework