SlideShare a Scribd company logo
Presented By Ladislav Urban

www.syoncloud.com
Ladislav Urban CEO of Syoncloud.
    Syoncloud is a consulting company specialized in
      Big Data analytics and integration of existing
                         systems.



WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
CURRENT SOURCES OF DATA TO
         BE PROCESSED AND UTILIZED
         
             Documents
         
             Existing relational databases (CRM, ERP, Accounting, Billing)
         
             E-mails and attachments
         
             Imaging data (graphs, technical plans)
         
             Sensor or device data
         
             Internet search indexing
         
             Log files
         
             Social media
WWW.SYONCLOUD.COM        E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
CURRENT SOURCES OF DATA TO
        BE PROCESSED AND UTILIZED
           
               Telephone conversations
           
               Videos
           
               Pictures
           
               Clickstreams (clicks from users on web pages)




WWW.SYONCLOUD.COM       E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
SCALE OF THE DATA




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
WHEN DO WE NEED NOSQL /                                  BIG
        DATA SOLUTION?
    
        If relational databases do not scale to your traffic needs
    
        If normalized schema of your relational database became too
        complex.
    
        If your business applications generate lots of supporting and
        temporary data
    
        If database schema is already denormalized in order to
        improve response times
    
        If joins in relational databases slow the system down to a crawl

WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
WHEN DO WE NEED NOSQL /
         BIG DATA SOLUTION?
         
             We try to map complex hierarchical documents to
             Database tables
         
             Documents from different sources require flexible
             schema
         
             When more data beats clever algorithms
         
             Flexibility is required for analytics
         
             Queries for values at specific time in history
         
             Need to utilize outputs from many existing systems
WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
WHEN DO WE NEED NOSQL /
         BIG DATA SOLUTION?
         
             To analyze unstructured data such as documents, log
             files or semi-structured data such as CSV files and
             forms




WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
WHAT ARE THE STRONG POINTS
         OF RELATIONAL DATABASES?
         
             SQL language. It is well known, standardized and based on
             strong mathematical theories.
         
             Database schemas that do not to be modified during
             production.
         
             Scalability is not required
         
             Mature security features: Role-based security, encrypted
             communications, row and field access control
         
             Full support of ACID transactions (atomicity, consistency,
             isolation, durability)

WWW.SYONCLOUD.COM       E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
WHAT ARE THE STRONG POINTS
        OF RELATIONAL DATABASES?
         
             Support for backup and rollback for data in case of
             data loss or corruption.
         
             Relational database do have development, tuning and
             monitoring tools with good GUI




WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
Batch vs Real-time Processing
         
             Batch processing is used when real-time processing is
             not required, not possible or too expensive.
         
             Conversion of unstructured data such as text files and
             log files into more structured records
         
             Transformation during ETL
         
             Ad-hoc analysis of data
         
             Data analytics application and reporting


WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
BATCH PROCESSING INFRASTRUCTURE




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
BATCH PROCESSING INFRASTRUCTURE
         
             Batch processing systems utilize Map/Reduce and
             HDFS implementation in Apache Hadoop.
         
             It is possible to develop batch processing application
             in Java using only Hadoop but we should mention
             other important systems and how they fit into
             Hadoop infrastructure.




WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE AVRO
          
              In order to process data we need to have information
              about data-types and data-schemas.
          
              This information is used for serialization and
              deserialization for RPC communications as well as
              reading and writing to files.




WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE AVRO
         
             RPC and serialization system that supports reach
             data structures
         
             It uses JSON to define data types and protocols
         
             It serializes data in a compact binary format
         
             Avro supports Schema evolution
         
             Avro will handle missing/extra/modified fields.



WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
SCRIPT LANGUAGE FOR MAP/REDUCE
         
             We need a quick and simple way to create
             Map/Reduce transformations, analysis and
             applications.
         
             We need a script language that can be used in scripts
             as well as interactively on command line.




WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE PIG




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE PIG
         
             High-level procedural language for querying large
             semi-structured data sets using Hadoop and the
             Map/Reduce Platform
         
             Pig simplifies the use of Hadoop by allowing SQL-like
             queries to run on distributed dataset.




WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE PIG
       
           An example of filtering log file for only Warning messages
           that will run in parallel on large cluster.
       
           Given script is automatically transformed into Map/Reduce
           program and distributed across Hadoop cluster.

           messages = LOAD '/var/log/messages';
           warns = FILTER messages BY $0 MATCHES '.*WARN+.*';
           DUMP warns



WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE PIG
          Relational operators that can be used in Pig
      
          FILTER    - Select a set of tuples from a relation based on a condition.
      
          FOREACH - Iterate the tuples of a relation, generating a data
                 transformation.
      
          GROUP     - Group the data in one or more relations.
      
          JOIN      - Join two or more relations (inner or outer join).
      
          LOAD      - Load data from the file system.
      
          ORDER     - Sort a relation based on one or more fields.
      
          SPLIT     - Partition a relation into two or more relations.
      
          STORE     - Store data in the file system.

WWW.SYONCLOUD.COM        E-MAIL : INFO@SYONCLOUD.COM      MOBILE : 077 9664 6474
What if we want to use SQL to create
             map/reduce jobs?
         
             Apache Hive is a data warehousing infrastructure
             based on the Hadoop
         
             It provides query language called Hive QL, which is
             based on SQL.




WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE HIVE
          
              Hive functions: data summarization, query and
              analysis.
          
              It uses system catalog called Hive-Metastore.
          
              Hive is not designed for OLTP or Real-time queries.
          
              It is best used for batch jobs over large sets of
              append-only data.




WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM    MOBILE : 077 9664 6474
APACHE HIVE




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
HiveQL language supports ability to
      
          Filter rows from a table using a where clause.
      
          Select certain columns from the table using a select clause.
      
          Do equi-joins between two tables.
      
          Evaluate aggregations on multiple "group by" columns for the
          data stored in a table.
      
          Store the results of a query into another table.
      
          Download the contents of a table to a local (NFS) directory.


WWW.SYONCLOUD.COM       E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
HiveQL language supports ability to
       
           Store the results of a query in a HDFS directory.
       
           Manage tables and partitions (create, drop and alter).
       
           Plug in custom scripts in the language of choice for custom
           map/reduce jobs.




WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE OOZIE
          
              Map/Reduce jobs, Pig Scripts and Hive queries
              should be simple and single purposed.
          
              How can we create complex ETL or data analysis in
              Hadoop?
          
              We chain scripts so output of one script is an input
              for another.
          
              Complex workflows that represents real-world
              scenarios need workflow engine such as Apache
              Oozie.

WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE OOZIE
        
            Oozie is a server based Workflow Engine specialized in
            running workflow jobs with actions that run Hadoop
            Map/Reduce, Pig jobs and other.
        
            Oozie workflow is a collection of actions arranged in DAG
            (Directed Acyclic Graph).
        
            This means that second action can not run until the first
            one is completed.
        
            Oozie workflows definitions are written in hPDL (a XML
            Process Definition Language similar to JBOSS JBPM jPDL).

WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE OOZIE
         
             Workflow actions start jobs in Hadoop cluster. Upon
             action completion, the Hadoop callback Oozie to
             notify the action completion, at this point Oozie
             proceeds to the next action in the workflow.
         
             Oozie workflows contain control flow nodes (start,
             end, fail, decision, fork and join) and action nodes
             (Actual Jobs).
         
             Workflows can be parameterized (using variables
             like ${inputDir} within the workflow definition)


WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
Example of OOZIE workflow definition




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
Example of OOZIE workflow definition
                  workflow-app name='wordcount-wf'
                  xmlns="uri:oozie:workflow:0.1">
                     <start to='wordcount'/>
                     <action name='wordcount'>
                       <map-reduce>
                          <job-tracker>${jobTracker}</job-
                  tracker>
                          <name-node>${nameNode}</name-
                  node>
                          <configuration>
                             <property>

                  <name>mapred.mapper.class</name>

                  <value>org.myorg.WordCount.Map</value>
                            </property>
                            <property>

                  <name>mapred.reducer.class</name>

WWW.SYONCLOUD.COM
                <value>org.myorg.WordCount.Reduce</value : 077 9664 6474
                     E-MAIL : INFO@SYONCLOUD.COM MOBILE
Example of OOZIE workflow definition
                          < property>
                              <name>mapred.input.dir</name>
                              <value>${inputDir}</value>
                           </property>
                           <property>

                <name>mapred.output.dir</name>
                                <value>${outputDir}</value>
                             </property>
                          </configuration>
                      </map-reduce>
                      <ok to='end'/>
                      <error to='end'/>
                   </action>
                   <kill name='kill'>
                      <message>Something went wrong: $
                {wf:errorCode('wordcount')}</message>
                   </kill/>
                   <end name='end'/>
                </workflow-app>
WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM      MOBILE : 077 9664 6474
APACHE Sqoop




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE Sqoop
        
            Apache Sqoop is a tool for transferring bulk data between
            Apache Hadoop and structured datastores such as
            relational databases or data warehouses.
        
            It can be used to populate tables in Hive and HBase.
        
            Sqoop integrates with Oozie, allowing you to schedule
            and automate import and export tasks.
        
            Sqoop uses a connector based architecture which
            supports plugins that provide connectivity to external
            systems.

WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE Sqoop
        
            Sqoop includes connectors for databases such as
            MySQL, PostgreSQL, Oracle, SQL Server, DB2 and
            generic JDBC connector.
        
            Transferred dataset is sliced up into partitions and
            map-only job is launched with individual mappers
            responsible for transferring a slice of this dataset.
        
            Sqoop uses the database metadata to infer data types




WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
Apache Sqoop – Import to HDFS




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE Sqoop
       Sqoop example to import data from MySQL database ORDERS
       table to Hive table running on Hadoop.
      sqoop import --connect jdbc:mysql://localhost/acmedb 
        --table ORDERS --username test --password **** --hive-
       import


       Sqoop takes care of populating Hive metastore with
       appropriate metadata for the table and also invokes necessary
       commands to load the table or partition.

WWW.SYONCLOUD.COM    E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
Apache Sqoop – Export to Database




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE FLUME
        ▪ Is a distributed system to reliably collect, aggregate and
          move large amounts of log data from many different
          sources to a centralized data store.




WWW.SYONCLOUD.COM    E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE FLUME




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE FLUME
         
             Flume Source consumes events delivered to it by an
             external source like a web server.
         
             When a Flume Source receives an event, it stores it
             into one or more Channels.
         
             The Channel is a passive store that keeps the event
             until it is consumed by a Flume Sink.
         
             The Sink removes the event from the Channel and
             puts it into an external repository like HDFS


WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
APACHE FLUME FEATURES
     
         It allows to build multi-hop flows where events travel through
         multiple agents before reaching the final destination.
     
         It also allows fan-in and fan-out flows, contextual routing and
         backup routes (fail-over) for failed hops.
     
         Flume uses a transactional approach to guarantee reliable
         delivery of events.
     
         Events are staged in the channel, which manages recovery from
         failure.
     
         Flume supports log stream types such as Avro, Syslog, Netcat .

WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
DISTCP - DISTRIBUTED COPY
       
           DistCp (distributed copy) is a tool used for large
           inter/intra-cluster copying.
       
           It uses Map/Reduce for its distribution, error handling
           and recovery and reporting.
       
           It expands a list of files and directories into input to map
           tasks, each of which will copy a partition of the files
           specified in the source list.



WWW.SYONCLOUD.COM      E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
REAL-TIME PROCESSING – NOSQL
         DATABASES
                    ▪ 5.1 Document stores
                    Apache CouchDB, MongoDB,
                    ▪ 5.2 Graph Stores
                    Neo4j
                    ▪ 5.3 Key-Value Stores
                    Apache Cassandra, Riak
                    ▪ 5.4 Tabular Stores
                    Apache Hbase

WWW.SYONCLOUD.COM     E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
CAP THEOREM




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
HBASE ARCHITECTURE




WWW.SYONCLOUD.COM   E-MAIL : INFO@SYONCLOUD.COM   MOBILE : 077 9664 6474
QUESTIONS & ANSWERS



                    www.syoncloud.com
LADISLAV URBAN
                  info@syoncloud.com

                 Mobile : 077 9664 6474

More Related Content

What's hot

Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
Hortonworks
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
Omar Jaber
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Hortonworks
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
Rohit Jain
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation
Felix Liao
 
Future of-hadoop-analytics
Future of-hadoop-analyticsFuture of-hadoop-analytics
Future of-hadoop-analytics
MapR Technologies
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudBring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
DataWorks Summit/Hadoop Summit
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
Anthony Thomas
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Hortonworks
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop WorldCloudera, Inc.
 
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
Hortonworks
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 

What's hot (20)

Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation
 
Future of-hadoop-analytics
Future of-hadoop-analyticsFuture of-hadoop-analytics
Future of-hadoop-analytics
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudBring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 

Similar to Big data overview

Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
Milos Milovanovic
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
Darko Marjanovic
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
Amazon Web Services
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
Trivadis
 
How to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data ManagementHow to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data Management
Clusterpoint
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
Mostafa
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
ModusOptimum
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Rizaldy Ignacio
 
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)
Cloudera, Inc.
 

Similar to Big data overview (20)

Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
 
How to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data ManagementHow to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data Management
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)
 

Big data overview

  • 1. Presented By Ladislav Urban www.syoncloud.com
  • 2. Ladislav Urban CEO of Syoncloud. Syoncloud is a consulting company specialized in Big Data analytics and integration of existing systems. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 3. CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED  Documents  Existing relational databases (CRM, ERP, Accounting, Billing)  E-mails and attachments  Imaging data (graphs, technical plans)  Sensor or device data  Internet search indexing  Log files  Social media WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 4. CURRENT SOURCES OF DATA TO BE PROCESSED AND UTILIZED  Telephone conversations  Videos  Pictures  Clickstreams (clicks from users on web pages) WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 5. SCALE OF THE DATA WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 6. WHEN DO WE NEED NOSQL / BIG DATA SOLUTION?  If relational databases do not scale to your traffic needs  If normalized schema of your relational database became too complex.  If your business applications generate lots of supporting and temporary data  If database schema is already denormalized in order to improve response times  If joins in relational databases slow the system down to a crawl WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 7. WHEN DO WE NEED NOSQL / BIG DATA SOLUTION?  We try to map complex hierarchical documents to Database tables  Documents from different sources require flexible schema  When more data beats clever algorithms  Flexibility is required for analytics  Queries for values at specific time in history  Need to utilize outputs from many existing systems WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 8. WHEN DO WE NEED NOSQL / BIG DATA SOLUTION?  To analyze unstructured data such as documents, log files or semi-structured data such as CSV files and forms WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 9. WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES?  SQL language. It is well known, standardized and based on strong mathematical theories.  Database schemas that do not to be modified during production.  Scalability is not required  Mature security features: Role-based security, encrypted communications, row and field access control  Full support of ACID transactions (atomicity, consistency, isolation, durability) WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 10. WHAT ARE THE STRONG POINTS OF RELATIONAL DATABASES?  Support for backup and rollback for data in case of data loss or corruption.  Relational database do have development, tuning and monitoring tools with good GUI WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 11. Batch vs Real-time Processing  Batch processing is used when real-time processing is not required, not possible or too expensive.  Conversion of unstructured data such as text files and log files into more structured records  Transformation during ETL  Ad-hoc analysis of data  Data analytics application and reporting WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 12. BATCH PROCESSING INFRASTRUCTURE WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 13. BATCH PROCESSING INFRASTRUCTURE  Batch processing systems utilize Map/Reduce and HDFS implementation in Apache Hadoop.  It is possible to develop batch processing application in Java using only Hadoop but we should mention other important systems and how they fit into Hadoop infrastructure. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 14. APACHE AVRO  In order to process data we need to have information about data-types and data-schemas.  This information is used for serialization and deserialization for RPC communications as well as reading and writing to files. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 15. APACHE AVRO  RPC and serialization system that supports reach data structures  It uses JSON to define data types and protocols  It serializes data in a compact binary format  Avro supports Schema evolution  Avro will handle missing/extra/modified fields. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 16. SCRIPT LANGUAGE FOR MAP/REDUCE  We need a quick and simple way to create Map/Reduce transformations, analysis and applications.  We need a script language that can be used in scripts as well as interactively on command line. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 17. APACHE PIG WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 18. APACHE PIG  High-level procedural language for querying large semi-structured data sets using Hadoop and the Map/Reduce Platform  Pig simplifies the use of Hadoop by allowing SQL-like queries to run on distributed dataset. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 19. APACHE PIG  An example of filtering log file for only Warning messages that will run in parallel on large cluster.  Given script is automatically transformed into Map/Reduce program and distributed across Hadoop cluster. messages = LOAD '/var/log/messages'; warns = FILTER messages BY $0 MATCHES '.*WARN+.*'; DUMP warns WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 20. APACHE PIG Relational operators that can be used in Pig  FILTER - Select a set of tuples from a relation based on a condition.  FOREACH - Iterate the tuples of a relation, generating a data transformation.  GROUP - Group the data in one or more relations.  JOIN - Join two or more relations (inner or outer join).  LOAD - Load data from the file system.  ORDER - Sort a relation based on one or more fields.  SPLIT - Partition a relation into two or more relations.  STORE - Store data in the file system. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 21. What if we want to use SQL to create map/reduce jobs?  Apache Hive is a data warehousing infrastructure based on the Hadoop  It provides query language called Hive QL, which is based on SQL. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 22. APACHE HIVE  Hive functions: data summarization, query and analysis.  It uses system catalog called Hive-Metastore.  Hive is not designed for OLTP or Real-time queries.  It is best used for batch jobs over large sets of append-only data. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 23. APACHE HIVE WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 24. HiveQL language supports ability to  Filter rows from a table using a where clause.  Select certain columns from the table using a select clause.  Do equi-joins between two tables.  Evaluate aggregations on multiple "group by" columns for the data stored in a table.  Store the results of a query into another table.  Download the contents of a table to a local (NFS) directory. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 25. HiveQL language supports ability to  Store the results of a query in a HDFS directory.  Manage tables and partitions (create, drop and alter).  Plug in custom scripts in the language of choice for custom map/reduce jobs. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 26. APACHE OOZIE  Map/Reduce jobs, Pig Scripts and Hive queries should be simple and single purposed.  How can we create complex ETL or data analysis in Hadoop?  We chain scripts so output of one script is an input for another.  Complex workflows that represents real-world scenarios need workflow engine such as Apache Oozie. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 27. APACHE OOZIE  Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce, Pig jobs and other.  Oozie workflow is a collection of actions arranged in DAG (Directed Acyclic Graph).  This means that second action can not run until the first one is completed.  Oozie workflows definitions are written in hPDL (a XML Process Definition Language similar to JBOSS JBPM jPDL). WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 28. APACHE OOZIE  Workflow actions start jobs in Hadoop cluster. Upon action completion, the Hadoop callback Oozie to notify the action completion, at this point Oozie proceeds to the next action in the workflow.  Oozie workflows contain control flow nodes (start, end, fail, decision, fork and join) and action nodes (Actual Jobs).  Workflows can be parameterized (using variables like ${inputDir} within the workflow definition) WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 29. Example of OOZIE workflow definition WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 30. Example of OOZIE workflow definition workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job- tracker> <name-node>${nameNode}</name- node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> WWW.SYONCLOUD.COM <value>org.myorg.WordCount.Reduce</value : 077 9664 6474 E-MAIL : INFO@SYONCLOUD.COM MOBILE
  • 31. Example of OOZIE workflow definition < property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>Something went wrong: $ {wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app> WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 32. APACHE Sqoop WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 33. APACHE Sqoop  Apache Sqoop is a tool for transferring bulk data between Apache Hadoop and structured datastores such as relational databases or data warehouses.  It can be used to populate tables in Hive and HBase.  Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.  Sqoop uses a connector based architecture which supports plugins that provide connectivity to external systems. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 34. APACHE Sqoop  Sqoop includes connectors for databases such as MySQL, PostgreSQL, Oracle, SQL Server, DB2 and generic JDBC connector.  Transferred dataset is sliced up into partitions and map-only job is launched with individual mappers responsible for transferring a slice of this dataset.  Sqoop uses the database metadata to infer data types WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 35. Apache Sqoop – Import to HDFS WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 36. APACHE Sqoop Sqoop example to import data from MySQL database ORDERS table to Hive table running on Hadoop. sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --hive- import Sqoop takes care of populating Hive metastore with appropriate metadata for the table and also invokes necessary commands to load the table or partition. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 37. Apache Sqoop – Export to Database WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 38. APACHE FLUME ▪ Is a distributed system to reliably collect, aggregate and move large amounts of log data from many different sources to a centralized data store. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 39. APACHE FLUME WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 40. APACHE FLUME  Flume Source consumes events delivered to it by an external source like a web server.  When a Flume Source receives an event, it stores it into one or more Channels.  The Channel is a passive store that keeps the event until it is consumed by a Flume Sink.  The Sink removes the event from the Channel and puts it into an external repository like HDFS WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 41. APACHE FLUME FEATURES  It allows to build multi-hop flows where events travel through multiple agents before reaching the final destination.  It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.  Flume uses a transactional approach to guarantee reliable delivery of events.  Events are staged in the channel, which manages recovery from failure.  Flume supports log stream types such as Avro, Syslog, Netcat . WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 42. DISTCP - DISTRIBUTED COPY  DistCp (distributed copy) is a tool used for large inter/intra-cluster copying.  It uses Map/Reduce for its distribution, error handling and recovery and reporting.  It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 43. REAL-TIME PROCESSING – NOSQL DATABASES ▪ 5.1 Document stores Apache CouchDB, MongoDB, ▪ 5.2 Graph Stores Neo4j ▪ 5.3 Key-Value Stores Apache Cassandra, Riak ▪ 5.4 Tabular Stores Apache Hbase WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 44. CAP THEOREM WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 45. HBASE ARCHITECTURE WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
  • 46. QUESTIONS & ANSWERS www.syoncloud.com LADISLAV URBAN info@syoncloud.com Mobile : 077 9664 6474