• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big data, Hadoop, NoSQL DB - introduction
 

Big data, Hadoop, NoSQL DB - introduction

on

  • 903 views

The presenatation is about new idea of storing and processing distributed data.

The presenatation is about new idea of storing and processing distributed data.

Statistics

Views

Total Views
903
Views on SlideShare
902
Embed Views
1

Actions

Likes
1
Downloads
49
Comments
0

1 Embed 1

http://searchutil01 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big data, Hadoop, NoSQL DB - introduction Big data, Hadoop, NoSQL DB - introduction Presentation Transcript

    • Big Data, Hadoop, NoSQL DB - Introduction Ing. Ľuboš Takáč, PhD. University of Žilina November, 2013
    • Overview • Big Data • Hadoop – HDFS – Map Reduce Paradigm • NoSQL Databases
    • Big Data • the origin of the term “BIG DATA” is unclear • there are a lot of definitions, e.g. “Big data is now almost universally understood to refer to the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies.” Matt Aslett
    • Big Data • Can be defined by (original) 3V – Volume (a lot of data) – Variety (various structured) – Velocity (fast processing) – other V • Veracity (IBM) • Value (Oracle) • Etc.
    • Where are Big Data Generated
    • Sample of Big Data Use Cases Today
    • Hadoop • new idea to store and process distributed data • open source project based on google GFS (Google distributed File System) and Map Reduce Paradigm – google published papers in 2003-2004 about GFS and Map Reduce • open source community led by Dough Cutting applied this tools on open search engine Nutch • 2006 became an own research project named HADOOP
    • Different Approach for Data Processing powerful hardware commodity hardware
    • HDFS (Hadoop Distributed File System) • the core part of Hadoop • open source implementation of Google's GFS (Google File System) • designed for commodity hardware • responsible for distributing files throughout the cluster (connected PCs in hadoop) • designed for high throughput rather than low latency • typical files are in GB size • files are broken down into blocks (64MB, 128MB) • blocks are replicated (typical 3 replicas) • rack aware, write once (append) • fault tolerance
    • HDFS – example of using • $ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg – (it is something like virtual folder, after copying all PC in cluster can access those files) • $ bin/hadoop dfs -ls /user/hadoop – (virtual folder is accessible via common commands)
    • Map Reduce Paradigm • processing of data stored in HDFS • map task – works locally on a part of the overall data • reduce task – collect and process the results of mapped task
    • Map Reduce Example “Hello World” • text files over HDFS • word count – counting the frequency of words
    • Map Reduce Example (Code) Map phase Reduce Phase
    • Map Reduce Example (How it works)
    • Map Reduce Task (Execution) • $ bin/hadoop jar WordCount.jar /user/hadoop/input_dir /user/hadoop/output_dir • $ bin/hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000
    • Map Reduce Task – Monitoring & Debugging • hadoop has interactive web interface for watching tasks and cluster • log files
    • Hadoop Ecosystem • the other tools usable in hadoop (or made for hadoop)
    • Hadoop Ecosystem • Hadoop (HDFS, Map Reduce Framework) • Avro (data serialization) • Chukwa (monitoring large clustered systems) • Flume (data collection and navigation) • HBase (real-time read and write database) • Hive (data summarization and querying) • Lucene (text search) • Pig (programming and query language) • Sqoop (data transfer between hadoop and databases) • Oozie (work flow and job orchestration) • etc.
    • Hadoop Distributions • open source (hard to configure), http://hadoop.apache.org/ • commercial solutions – debugged ready-made solutions with support – include proprietary software and hardware – user friendly interfaces, also in cloud – IBM • InfoSphere BigInsights • Cloudera – ORACLE • Exadata • Exalytics
    • NoSQL Databases • SQL – Traditional relational DBMS • not every data management/analysis problem is best solved exclusively using a traditional relational DBMS • NoSQL = No SQL = not using traditional relational DBMS • NoSQL = not only SQL • NoSQL is not substitution for SQL DBMS and even they do not try to replace them • often used for Big Data
    • NoSQL Databases • designed for fast retrieval and appending operations • no data structures • types – – – – document store graph databases key-value store etc. • key-value store (like relational table with two columns, key and value)
    • NoSQL Databases • advantages – low latency, high throughput – highly parallelizable, massive scalability – simplicity of design, easy to set up – relaxed consistency => higher performance and availability • disadvantages – no declarative query language => more programming – relaxed consistency => fewer guarantees – absence of model => data model is inside the application (a big step back) • examples: MongoDB, Neo4j, Dynamo, HBase, Allegro, Cassandra, etc.
    • Summary • Big Data – unstructured typically generated data (sensors, applications) with potential – often not used before – volume, variety, velocity => hard to process it by traditional technologies • Hadoop – open source technology for storing and processing distributed data – processing Big Data on commodity hardware cluster – HDFS, Map Reduce (and the other components of Hadoop Ecosystem) • NoSQL Databases – not using traditional relational DBMS – typically key-value stores, easy – designed for fast retrieval and appending operations – highly parallelizable
    • References • [1] JP. Dijcks, Oracle: Big Data for the Enterprise, Jan. 2012. • [2] Ľ. Takáč, Data Processing over Very Large Databases, PhD thesis, 2013. • [3] O. Dolák, Big Data, http://www.systemonline.cz, 2012. • [4] P. Zikopoulos, D. Deroos, K. Parasuraman, T. Deutsch, D. Corrigan, J. Giles, Harness the Power of Big Data, ISBN 978-0-07-180817-0, 2013. • [5] http://www.go-globe.com, 2013. • [6] Kanik T., Kováč M., NOSQL - Non-Relational Database Systems as the New Generation of DBMS, OSSConf, 2012. • [7] http://wiki.apache.org/hadoop, 2013. • [8] http://hadoop.apache.org, 2013. • [9] L22: SC Report, Map Reduce, The University of Utah • [10] http://bigdatauniversity.com, 2013. • [11] http://en.wikipedia.org/wiki/NoSQL
    • Thank you for your attention! lubos.takac@gmail.com