SlideShare a Scribd company logo
Big Data, Hadoop, NoSQL DB - Introduction
Ing. Ľuboš Takáč, PhD.

University of Žilina

November, 2013
Overview
• Big Data

• Hadoop
– HDFS
– Map Reduce Paradigm

• NoSQL Databases
Big Data
• the origin of the term “BIG DATA” is unclear

• there are a lot of definitions,
e.g. “Big data is now almost universally understood to refer to the
realization of greater business intelligence by storing, processing, and
analyzing data that was previously ignored due to the limitations of traditional
data management technologies.” Matt Aslett
Big Data
• Can be defined by (original) 3V
– Volume (a lot of data)

– Variety (various structured)
– Velocity (fast processing)
– other V
• Veracity (IBM)
• Value (Oracle)
• Etc.
Where are Big Data Generated
Sample of Big Data Use Cases Today
Hadoop
• new idea to store and process distributed data
• open source project based on google GFS (Google
distributed File System) and Map Reduce Paradigm
– google published papers in 2003-2004 about GFS and Map Reduce

• open source community led by Dough Cutting applied this
tools on open search engine Nutch
• 2006 became an own research project named HADOOP
Different Approach for Data Processing

powerful hardware

commodity hardware
HDFS (Hadoop Distributed File System)
• the core part of Hadoop

• open source implementation of Google's GFS (Google File System)
• designed for commodity hardware
• responsible for distributing files throughout the cluster (connected PCs in hadoop)

• designed for high throughput rather than low latency
• typical files are in GB size
• files are broken down into blocks (64MB, 128MB)

• blocks are replicated (typical 3 replicas)
• rack aware, write once (append)
• fault tolerance
HDFS – example of using

• $ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg
– (it is something like virtual folder, after copying all PC in cluster can access those files)

• $ bin/hadoop dfs -ls /user/hadoop
– (virtual folder is accessible via common commands)
Map Reduce Paradigm
• processing of data stored in HDFS
• map task – works locally on a part of the overall data
• reduce task – collect and process the results of mapped task
Map Reduce Example “Hello World”

• text files over HDFS
• word count – counting the frequency of words
Map Reduce Example (Code)
Map phase
Reduce Phase
Map Reduce Example (How it works)
Map Reduce Task (Execution)

• $ bin/hadoop jar WordCount.jar /user/hadoop/input_dir /user/hadoop/output_dir

• $ bin/hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000
Map Reduce Task – Monitoring & Debugging
• hadoop has interactive web interface for watching tasks and
cluster
• log files
Hadoop Ecosystem
• the other tools usable in hadoop (or made for hadoop)
Hadoop Ecosystem
• Hadoop (HDFS, Map Reduce Framework)

• Avro (data serialization)
• Chukwa (monitoring large clustered systems)
• Flume (data collection and navigation)

• HBase (real-time read and write database)
• Hive (data summarization and querying)
• Lucene (text search)
• Pig (programming and query language)
• Sqoop (data transfer between hadoop and databases)
• Oozie (work flow and job orchestration)
• etc.
Hadoop Distributions
• open source (hard to configure), http://hadoop.apache.org/

• commercial solutions
– debugged ready-made solutions with support
– include proprietary software and hardware

– user friendly interfaces, also in cloud
– IBM
• InfoSphere BigInsights
• Cloudera

– ORACLE
• Exadata
• Exalytics
NoSQL Databases
• SQL – Traditional relational DBMS
• not every data management/analysis problem is best solved
exclusively using a traditional relational DBMS

• NoSQL = No SQL = not using traditional relational DBMS
• NoSQL = not only SQL
• NoSQL is not substitution for SQL DBMS and even they do
not try to replace them
• often used for Big Data
NoSQL Databases
• designed for fast retrieval and appending operations

• no data structures
• types
–
–
–
–

document store
graph databases
key-value store
etc.

• key-value store (like relational table with two columns, key
and value)
NoSQL Databases
• advantages
– low latency, high throughput
– highly parallelizable, massive scalability
– simplicity of design, easy to set up

– relaxed consistency => higher performance and availability

• disadvantages
– no declarative query language => more programming
– relaxed consistency => fewer guarantees
– absence of model => data model is inside the application (a big step back)

• examples: MongoDB, Neo4j, Dynamo, HBase, Allegro, Cassandra, etc.
Summary
• Big Data
– unstructured typically generated data (sensors, applications) with potential
– often not used before
– volume, variety, velocity => hard to process it by traditional technologies

• Hadoop
– open source technology for storing and processing distributed data
– processing Big Data on commodity hardware cluster
– HDFS, Map Reduce (and the other components of Hadoop Ecosystem)

• NoSQL Databases
– not using traditional relational DBMS
– typically key-value stores, easy
– designed for fast retrieval and appending operations
– highly parallelizable
References
•

[1] JP. Dijcks, Oracle: Big Data for the Enterprise, Jan. 2012.

•

[2] Ľ. Takáč, Data Processing over Very Large Databases, PhD thesis, 2013.

•

[3] O. Dolák, Big Data, http://www.systemonline.cz, 2012.

•

[4] P. Zikopoulos, D. Deroos, K. Parasuraman, T. Deutsch, D. Corrigan, J. Giles, Harness the Power of Big Data,
ISBN 978-0-07-180817-0, 2013.

•

[5] http://www.go-globe.com, 2013.

•

[6] Kanik T., Kováč M., NOSQL - Non-Relational Database Systems as the New Generation of DBMS, OSSConf,
2012.

•

[7] http://wiki.apache.org/hadoop, 2013.

•

[8] http://hadoop.apache.org, 2013.

•

[9] L22: SC Report, Map Reduce, The University of Utah

•

[10] http://bigdatauniversity.com, 2013.

•

[11] http://en.wikipedia.org/wiki/NoSQL
Thank you for your attention!
lubos.takac@gmail.com

More Related Content

What's hot

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
AyeeshaParveen
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoop
bigdatasyd
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
ShivanandaVSeeri
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
Csaba Toth
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
 
Data warehousing con hadoop y el paradigma map reduce
Data warehousing con hadoop y el paradigma map reduceData warehousing con hadoop y el paradigma map reduce
Data warehousing con hadoop y el paradigma map reduce
Ismel Martínez Díaz
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
Denis Shestakov
 
Big data
Big dataBig data
Big data
Alisha Roy
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
Anupama Prabhudesai
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Deborah Akuoko
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
Kannappan Sirchabesan
 

What's hot (20)

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoop
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Anju
AnjuAnju
Anju
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Data warehousing con hadoop y el paradigma map reduce
Data warehousing con hadoop y el paradigma map reduceData warehousing con hadoop y el paradigma map reduce
Data warehousing con hadoop y el paradigma map reduce
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Big data
Big dataBig data
Big data
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 

Viewers also liked

Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
Arjen de Vries
 
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
BIOVIA
 
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
kvaderlipa
 
(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot
BIOVIA
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Data Con LA
 

Viewers also liked (6)

Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
 
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
 
(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 

Similar to Big data, Hadoop, NoSQL DB - introduction

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
Geoff Hendrey
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
Big Boxx Animation Academy
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Big data
Big dataBig data
Big data
Mayuri Verma
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Csaba Toth
 

Similar to Big data, Hadoop, NoSQL DB - introduction (20)

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Big data
Big dataBig data
Big data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 

More from kvaderlipa

2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
kvaderlipa
 
Art & Science Data Visualization
Art & Science Data VisualizationArt & Science Data Visualization
Art & Science Data Visualization
kvaderlipa
 
Visualization of Large Multivariate Data Sets using Parallel Coordinates
Visualization of Large Multivariate Data Sets using Parallel CoordinatesVisualization of Large Multivariate Data Sets using Parallel Coordinates
Visualization of Large Multivariate Data Sets using Parallel Coordinates
kvaderlipa
 
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length PatternsFast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
kvaderlipa
 
Design and Development of New Automatic on-line Media Monitoring System
Design and Development of New Automatic on-line Media Monitoring SystemDesign and Development of New Automatic on-line Media Monitoring System
Design and Development of New Automatic on-line Media Monitoring System
kvaderlipa
 
Data Processing over very Large Relational Databases
Data Processing over very Large Relational DatabasesData Processing over very Large Relational Databases
Data Processing over very Large Relational Databases
kvaderlipa
 

More from kvaderlipa (6)

2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
 
Art & Science Data Visualization
Art & Science Data VisualizationArt & Science Data Visualization
Art & Science Data Visualization
 
Visualization of Large Multivariate Data Sets using Parallel Coordinates
Visualization of Large Multivariate Data Sets using Parallel CoordinatesVisualization of Large Multivariate Data Sets using Parallel Coordinates
Visualization of Large Multivariate Data Sets using Parallel Coordinates
 
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length PatternsFast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
 
Design and Development of New Automatic on-line Media Monitoring System
Design and Development of New Automatic on-line Media Monitoring SystemDesign and Development of New Automatic on-line Media Monitoring System
Design and Development of New Automatic on-line Media Monitoring System
 
Data Processing over very Large Relational Databases
Data Processing over very Large Relational DatabasesData Processing over very Large Relational Databases
Data Processing over very Large Relational Databases
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

Big data, Hadoop, NoSQL DB - introduction

  • 1. Big Data, Hadoop, NoSQL DB - Introduction Ing. Ľuboš Takáč, PhD. University of Žilina November, 2013
  • 2. Overview • Big Data • Hadoop – HDFS – Map Reduce Paradigm • NoSQL Databases
  • 3. Big Data • the origin of the term “BIG DATA” is unclear • there are a lot of definitions, e.g. “Big data is now almost universally understood to refer to the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies.” Matt Aslett
  • 4. Big Data • Can be defined by (original) 3V – Volume (a lot of data) – Variety (various structured) – Velocity (fast processing) – other V • Veracity (IBM) • Value (Oracle) • Etc.
  • 5. Where are Big Data Generated
  • 6. Sample of Big Data Use Cases Today
  • 7. Hadoop • new idea to store and process distributed data • open source project based on google GFS (Google distributed File System) and Map Reduce Paradigm – google published papers in 2003-2004 about GFS and Map Reduce • open source community led by Dough Cutting applied this tools on open search engine Nutch • 2006 became an own research project named HADOOP
  • 8. Different Approach for Data Processing powerful hardware commodity hardware
  • 9. HDFS (Hadoop Distributed File System) • the core part of Hadoop • open source implementation of Google's GFS (Google File System) • designed for commodity hardware • responsible for distributing files throughout the cluster (connected PCs in hadoop) • designed for high throughput rather than low latency • typical files are in GB size • files are broken down into blocks (64MB, 128MB) • blocks are replicated (typical 3 replicas) • rack aware, write once (append) • fault tolerance
  • 10. HDFS – example of using • $ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg – (it is something like virtual folder, after copying all PC in cluster can access those files) • $ bin/hadoop dfs -ls /user/hadoop – (virtual folder is accessible via common commands)
  • 11. Map Reduce Paradigm • processing of data stored in HDFS • map task – works locally on a part of the overall data • reduce task – collect and process the results of mapped task
  • 12. Map Reduce Example “Hello World” • text files over HDFS • word count – counting the frequency of words
  • 13. Map Reduce Example (Code) Map phase Reduce Phase
  • 14. Map Reduce Example (How it works)
  • 15. Map Reduce Task (Execution) • $ bin/hadoop jar WordCount.jar /user/hadoop/input_dir /user/hadoop/output_dir • $ bin/hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000
  • 16. Map Reduce Task – Monitoring & Debugging • hadoop has interactive web interface for watching tasks and cluster • log files
  • 17.
  • 18.
  • 19. Hadoop Ecosystem • the other tools usable in hadoop (or made for hadoop)
  • 20. Hadoop Ecosystem • Hadoop (HDFS, Map Reduce Framework) • Avro (data serialization) • Chukwa (monitoring large clustered systems) • Flume (data collection and navigation) • HBase (real-time read and write database) • Hive (data summarization and querying) • Lucene (text search) • Pig (programming and query language) • Sqoop (data transfer between hadoop and databases) • Oozie (work flow and job orchestration) • etc.
  • 21. Hadoop Distributions • open source (hard to configure), http://hadoop.apache.org/ • commercial solutions – debugged ready-made solutions with support – include proprietary software and hardware – user friendly interfaces, also in cloud – IBM • InfoSphere BigInsights • Cloudera – ORACLE • Exadata • Exalytics
  • 22. NoSQL Databases • SQL – Traditional relational DBMS • not every data management/analysis problem is best solved exclusively using a traditional relational DBMS • NoSQL = No SQL = not using traditional relational DBMS • NoSQL = not only SQL • NoSQL is not substitution for SQL DBMS and even they do not try to replace them • often used for Big Data
  • 23. NoSQL Databases • designed for fast retrieval and appending operations • no data structures • types – – – – document store graph databases key-value store etc. • key-value store (like relational table with two columns, key and value)
  • 24. NoSQL Databases • advantages – low latency, high throughput – highly parallelizable, massive scalability – simplicity of design, easy to set up – relaxed consistency => higher performance and availability • disadvantages – no declarative query language => more programming – relaxed consistency => fewer guarantees – absence of model => data model is inside the application (a big step back) • examples: MongoDB, Neo4j, Dynamo, HBase, Allegro, Cassandra, etc.
  • 25. Summary • Big Data – unstructured typically generated data (sensors, applications) with potential – often not used before – volume, variety, velocity => hard to process it by traditional technologies • Hadoop – open source technology for storing and processing distributed data – processing Big Data on commodity hardware cluster – HDFS, Map Reduce (and the other components of Hadoop Ecosystem) • NoSQL Databases – not using traditional relational DBMS – typically key-value stores, easy – designed for fast retrieval and appending operations – highly parallelizable
  • 26. References • [1] JP. Dijcks, Oracle: Big Data for the Enterprise, Jan. 2012. • [2] Ľ. Takáč, Data Processing over Very Large Databases, PhD thesis, 2013. • [3] O. Dolák, Big Data, http://www.systemonline.cz, 2012. • [4] P. Zikopoulos, D. Deroos, K. Parasuraman, T. Deutsch, D. Corrigan, J. Giles, Harness the Power of Big Data, ISBN 978-0-07-180817-0, 2013. • [5] http://www.go-globe.com, 2013. • [6] Kanik T., Kováč M., NOSQL - Non-Relational Database Systems as the New Generation of DBMS, OSSConf, 2012. • [7] http://wiki.apache.org/hadoop, 2013. • [8] http://hadoop.apache.org, 2013. • [9] L22: SC Report, Map Reduce, The University of Utah • [10] http://bigdatauniversity.com, 2013. • [11] http://en.wikipedia.org/wiki/NoSQL
  • 27. Thank you for your attention! lubos.takac@gmail.com