Big Data
Presented by,
Mohamedsalman S
(BIT CSE)
contents
 Introduction.
 Components.
 Methods.
 What is Hadoop.
 Hadoop Offers.
 Map reduce.
 What is HPCC.
 HPCC Components.
 Big Data Samples.
 Difference between Hpcc and Hadoop.
 Private and Security issues.
 Knowledge Discovery.
 Conclusion.
Introduction
 Big data and its analysis are at the center of modern science and
business.
 These data are generated from online transactions, emails, videos,
audios, images etc.
 They are stored in databases grow massively and become difficult to
capture, store, manage, share.
 It is predicted to double every two years reaching about 8zettabytes
of data by 2015.
Components
 Vareity.
Variety makes big data really big.
Big data comes from a great variety of sources.
Generally has in three types structured, unstructured and semi-
structured.
Structured data inserts a data warehouse already tagged and
easily sorted.
Unstructured data is random and difficult to analyze.
Components
Semi-structured data does not conform to fixed fields but contains
tags to separate data elements.
 Volume.
Volume or the size of data now is larger than terabytes, petabytes and
zettabytes.
 Velocity.
The flow of data is massive and continuous.
Big data should be used as it streams into the organization in order to
maximize its value.
Methods
 Facing lots of new data which arrives in many different forms.
 Big data has generated a whole new industry of supporting
architectures such as MapReduce.
 MapReduce is a programming framework for distributed computing.
 Created by google using divide and conquer method.
 MapReduce can be divided into two stages.
Map Step. Hpcc.
Reduce Step. Hadoop.
What is Hadoop?
 Hadoop is an open-source software framework.
 Its Java based framework.
 Essentially it accomplishes two tasks massive data storage and faster
processing.
 Its not replace in database warehouse or ETL.
Hadoop Offers
 HDFS - responsible for storing data on the clusters.
 MapReduce.
 Hbase - distributed database for random read/write access.
 Pig - high level data processing system.
 Hive - data warehouse application.
 Sqoop - transferring data between relational databases and Hadoop.
Mapreduce
 MapReduce is a programming framework for distributed computing.
 Created by google using divide and conquer method.
 MapReduce can be divided into two stages.
Map Step.
Reduce Step.
Map Reduce
What is HPCC?
 HPCC also known as DAS.
 HPCC Systems distributed data intensive open source computing
platform and provides big data workflow management services.
 Unlike Hadoop, HPCC’s data model defined by user.
 HPCC Platform does not require third party tools like GreenPlum,
Cassandra, RDBMS, Oozie.
HPCC Components
 HPCC Data Refinery
Massively parallel ETL engine that enables data integration
and provides batch oriented data manipulation.
 HPCC Data Delivery Engine
High throughput, ultra fast, low latency.
 Enterprise Control Language
Simple usage programming language optimized for big data
operations and query transactions.
Big Data Samples
 Biological science.
 Life sciences.
 Medical records.
 Scientific research.
 Mobile phones.
 Government.
Difference between Hpcc and
Hadoop
Knowledge Discovery
 Some operations designed to get information from complicated data
sets.
 Removing noise, handling missing data fields and calculating time
information.
 Mapping purposes to a particular data mining methods.
 Choose data mining algorithm and method for searching data
patterns.
Privacy and Security Issues
 It required that big data stores are rightly controlled.
 To ensure authentication a cryptographically secure communication
framework has to be implemented.
 They control data according to specified by the regulations such as
imposing store periods.
 Organizations have to consider legal branching for storing data.
Knowledge Discovery
 Some operations designed to get information from complicated data
sets.
 Removing noise, handling missing data fields and calculating time
information.
 Mapping purposes to a particular data mining methods.
 Choose data mining algorithm and method for searching data
patterns.
Conclusion
 Difficult to managing the data.
 Data keep in secure manner.
 Its used more no of organization.

Big data

  • 1.
  • 2.
    contents  Introduction.  Components. Methods.  What is Hadoop.  Hadoop Offers.  Map reduce.  What is HPCC.  HPCC Components.  Big Data Samples.  Difference between Hpcc and Hadoop.  Private and Security issues.  Knowledge Discovery.  Conclusion.
  • 3.
    Introduction  Big dataand its analysis are at the center of modern science and business.  These data are generated from online transactions, emails, videos, audios, images etc.  They are stored in databases grow massively and become difficult to capture, store, manage, share.  It is predicted to double every two years reaching about 8zettabytes of data by 2015.
  • 4.
    Components  Vareity. Variety makesbig data really big. Big data comes from a great variety of sources. Generally has in three types structured, unstructured and semi- structured. Structured data inserts a data warehouse already tagged and easily sorted. Unstructured data is random and difficult to analyze.
  • 5.
    Components Semi-structured data doesnot conform to fixed fields but contains tags to separate data elements.  Volume. Volume or the size of data now is larger than terabytes, petabytes and zettabytes.  Velocity. The flow of data is massive and continuous. Big data should be used as it streams into the organization in order to maximize its value.
  • 6.
    Methods  Facing lotsof new data which arrives in many different forms.  Big data has generated a whole new industry of supporting architectures such as MapReduce.  MapReduce is a programming framework for distributed computing.  Created by google using divide and conquer method.  MapReduce can be divided into two stages. Map Step. Hpcc. Reduce Step. Hadoop.
  • 7.
    What is Hadoop? Hadoop is an open-source software framework.  Its Java based framework.  Essentially it accomplishes two tasks massive data storage and faster processing.  Its not replace in database warehouse or ETL.
  • 8.
    Hadoop Offers  HDFS- responsible for storing data on the clusters.  MapReduce.  Hbase - distributed database for random read/write access.  Pig - high level data processing system.  Hive - data warehouse application.  Sqoop - transferring data between relational databases and Hadoop.
  • 9.
    Mapreduce  MapReduce isa programming framework for distributed computing.  Created by google using divide and conquer method.  MapReduce can be divided into two stages. Map Step. Reduce Step.
  • 10.
  • 11.
    What is HPCC? HPCC also known as DAS.  HPCC Systems distributed data intensive open source computing platform and provides big data workflow management services.  Unlike Hadoop, HPCC’s data model defined by user.  HPCC Platform does not require third party tools like GreenPlum, Cassandra, RDBMS, Oozie.
  • 12.
    HPCC Components  HPCCData Refinery Massively parallel ETL engine that enables data integration and provides batch oriented data manipulation.  HPCC Data Delivery Engine High throughput, ultra fast, low latency.  Enterprise Control Language Simple usage programming language optimized for big data operations and query transactions.
  • 13.
    Big Data Samples Biological science.  Life sciences.  Medical records.  Scientific research.  Mobile phones.  Government.
  • 14.
  • 15.
    Knowledge Discovery  Someoperations designed to get information from complicated data sets.  Removing noise, handling missing data fields and calculating time information.  Mapping purposes to a particular data mining methods.  Choose data mining algorithm and method for searching data patterns.
  • 16.
    Privacy and SecurityIssues  It required that big data stores are rightly controlled.  To ensure authentication a cryptographically secure communication framework has to be implemented.  They control data according to specified by the regulations such as imposing store periods.  Organizations have to consider legal branching for storing data.
  • 17.
    Knowledge Discovery  Someoperations designed to get information from complicated data sets.  Removing noise, handling missing data fields and calculating time information.  Mapping purposes to a particular data mining methods.  Choose data mining algorithm and method for searching data patterns.
  • 18.
    Conclusion  Difficult tomanaging the data.  Data keep in secure manner.  Its used more no of organization.