BIG DATA
ANALYTICS
CONTENTS
1. Big Data
2. Data vs Big Data
3. Examples
4. Challenges
5. Big Data Analytics
6. Traditional vs Big Data analytics
7. Hadoop
8. Application
WHAT IS BIG DATA
Big data is a collection of data sets that are
large and complex in nature.
They grow both structured and unstructured
data that grow large so fast that they are not
manageable by traditional relational database
systems or conventional statistical tools.
DATA VS BIG DATA
Big data is just data with:
• More volume
• Faster data generation (velocity)
• Multiple data format (variety)
World's data volume to grow 40%
per year & 50 times by 2020 [1]
Data coming from various human
& machine activity
BIG DATA ANALYTICS
IN PRACTICE
1. The New York Stock Exchange generates about
one terabyte of new trade data per day.
2. Single Jet engine can generate 10+terabytes of
data in 30 minutes of a flight time. With many
thousand flights per day, generation of data
reaches up to many Petabytes.
3. Statistic shows that 500+terabytes of new data
gets ingested into the databases of social media
site Facebook, every day. This data is mainly
generated in terms of photo and video uploads,
message exchanges, putting comments etc.
CHALLENGES
More data = more storage space
• More storage = more money to spend (RDBMS server needs
very costly storage)
Data coming faster
• Speed up data processing or we’ll have backlog
Needs to handle various data structure
• How do we put JSON data format in standard RDBMS?
• Hey, we also have XML format from other sources
• Other system give us compressed data in gzip format
Agile business requirement
• On initial discussion, they only need 10 information, now they
ask for 25? Can we do that? We only put that 10 in our
database
TYPES OF BIG DATA
• Structured Data : Any data that can be stored,
accessed and processed in the form of fixed format is
termed as a 'structured' data.
• Un-Structured Data : Any data with unknown form or
the structure is classified as unstructured data.
• Semi-structured Data : Semi-structured data can
contain both the forms of data.
BENEFITS OF BIG
DATA PROCESSING
• Businesses can utilize outside intelligence while
taking decisions:- Access to social data from search
engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
• Improved customer service :- Traditional customer
feedback systems are getting replaced by new
systems designed with ‘Big Data’ technologies. In
these new systems, Big Data and natural language
processing technologies are being used to read and
evaluate consumer responses.
• Early identification of risk to the product/ services, if
any
• Better operational efficiency:-'Big Data' technologies
can be used for creating staging area or landing zone
for new data before identifying what data should be
moved to the data warehouse. In addition, such
integration of 'Big Data' technologies and data
warehouse helps organization to offload infrequently
accessed data.
BIG DATA ANALYTICS
Big data analytics is the process of examining large
and varied data sets -- i.e., big data -- to uncover
hidden patterns, unknown correlations, market trends,
customer preferences and other useful information
that can help organizations make more-informed
business decisions.
TRADITIONAL VS
BIG DATA ANALYTICS
Traditional analytics Big Data Analytics
Analytics using know data
which is well understood.
Not well understood data
format for it largely being
unstructured and semi
structured.
Build based on relational
data base model.
Big data comes in various
forms and formats from
multiple disconnected
system. They are almost flat
with no relationship.
4 TYPES OF
ANALYTICS
1. Descriptive : what happened ??
2. Diagnostic : why did it happened ??
3. Predictive : what is likely to happen ??
4. Prescriptive : what should I do about it ??
APPROACH TO ANALYTICS
1. Identify the data sources.
2. Select the right tools and technology to collect,
store, aggregate the data.
3. Understand the business domain.
4. Identify tools and technology to process the data.
5. Build mathematical models for the analytics .
6. Visualize.
7. Validate your result.
8. Learn, adopt, and rebuild your analytical model.
ANALYTICS TOOLS
Most used statistical programming tools are:
• IBM SPSS
• SAS
• R
• MATLAB
R and MATLAB have the most comprehensive
support of statistical functions.
HADOOP
Hadoop is a framework that allows for distributed
processing of large data sets across clusters of
commodity computers using a simple programming model
.
• Software framework that supports distributed
applications, licensed under the Apache v2 license.
• Hadoop was derived from Google's MapReduce and
Google File System papers.
• YAHOO is the largest contributor to the project
• Written in the Java programming language .
HADOOP :
MAPREDUCE
WHY USE HADOOP ?
• Need to compress data
• Nodes fail every day
• Common infrastructure
Efficient
Easy to use
Open Source
COMMON USES
• Searches
• Log processing
• Recommendation systems
• Analytics (Facebook, Linkedin)
• Image and video processing (NASA)
• Data retention
TECHNOLOGIES AND
TOOLS
Unstructured and semi-structured data types typically
don't fit well in traditional data warehouses that are
based on relational databases oriented to structured
data sets.
As a result, many organizations that collect, process
and analyze big data turn to NoSQL databases as well
as Hadoop and its companion tools, including:
MapReduce: a software framework that allows
developers to write programs that process massive
amounts of unstructured data in parallel across a
distributed cluster of processors or stand-alone
computers.
YARN: a cluster management technology and one
of the key features in second-generation Hadoop.
Spark: an open-source parallel processing
framework that enables users to run large-scale
data analytics applications across clustered
systems.
HBase: a column-oriented key/value data store
built to run on top of the Hadoop Distributed File
System (HDFS).
Hive: an open-source data warehouse system for
querying and analyzing large datasets stored in
Hadoop files.
Kafka: a distributed publish-subscribe messaging
system designed to replace traditional message
brokers.
Pig: an open-source technology that offers a
high-level mechanism for the parallel
programming of MapReduce jobs to be executed
on Hadoop clusters.
BIG DATA ANALYTICS
BENEFITS
• Driven by specialized analytics systems and
software, big data analytics can point the way to
various business benefits, including new revenue
opportunities, more effective marketing, better
customer service, improved operational efficiency
and competitive advantages over rivals.
• Big data analytics applications enable data
scientists, predictive modelers, statisticians and
other analytics professionals to analyze growing
volumes of structured transaction data, plus
other forms of data that are often left untapped by
conventional business intelligence (BI) and
analytics programs.
• On a broad scale, data analytics technologies and
techniques provide a means of analyzing data
sets and drawing conclusions about them to help
organizations make informed business decisions.
BIG DATA ANALYTICS
APPLICATION
• Government : The use and adoption of big data
within governmental processes allows efficiencies
in terms of cost, productivity, and innovation, but
does not come without its flaws.
• Manufacturing: Based on TCS 2013 Global Trend
Study, improvements in supply planning and
product quality provide the greatest benefit of big
data for manufacturing.
• Information Technology :Especially since 2015, big
data has come to prominence within Business
Operations as a tool to help employees work more
efficiently and streamline the collection and
distribution of Information Technology (IT).
• Education: A McKinsey Global Institute study found a
shortage of 1.5 million highly trained data
professionals and managers and a number of
universities including University of Tennessee and UC
Berkeley, have created masters programs to meet this
demand.
THANK YOU

Big data analytics

  • 1.
  • 2.
    CONTENTS 1. Big Data 2.Data vs Big Data 3. Examples 4. Challenges 5. Big Data Analytics 6. Traditional vs Big Data analytics 7. Hadoop 8. Application
  • 3.
    WHAT IS BIGDATA Big data is a collection of data sets that are large and complex in nature. They grow both structured and unstructured data that grow large so fast that they are not manageable by traditional relational database systems or conventional statistical tools.
  • 5.
    DATA VS BIGDATA Big data is just data with: • More volume • Faster data generation (velocity) • Multiple data format (variety) World's data volume to grow 40% per year & 50 times by 2020 [1] Data coming from various human & machine activity
  • 6.
    BIG DATA ANALYTICS INPRACTICE 1. The New York Stock Exchange generates about one terabyte of new trade data per day. 2. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With many thousand flights per day, generation of data reaches up to many Petabytes. 3. Statistic shows that 500+terabytes of new data gets ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
  • 7.
    CHALLENGES More data =more storage space • More storage = more money to spend (RDBMS server needs very costly storage) Data coming faster • Speed up data processing or we’ll have backlog Needs to handle various data structure • How do we put JSON data format in standard RDBMS? • Hey, we also have XML format from other sources • Other system give us compressed data in gzip format Agile business requirement • On initial discussion, they only need 10 information, now they ask for 25? Can we do that? We only put that 10 in our database
  • 8.
    TYPES OF BIGDATA • Structured Data : Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. • Un-Structured Data : Any data with unknown form or the structure is classified as unstructured data. • Semi-structured Data : Semi-structured data can contain both the forms of data.
  • 9.
    BENEFITS OF BIG DATAPROCESSING • Businesses can utilize outside intelligence while taking decisions:- Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies. • Improved customer service :- Traditional customer feedback systems are getting replaced by new systems designed with ‘Big Data’ technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.
  • 10.
    • Early identificationof risk to the product/ services, if any • Better operational efficiency:-'Big Data' technologies can be used for creating staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of 'Big Data' technologies and data warehouse helps organization to offload infrequently accessed data.
  • 11.
    BIG DATA ANALYTICS Bigdata analytics is the process of examining large and varied data sets -- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful information that can help organizations make more-informed business decisions.
  • 12.
    TRADITIONAL VS BIG DATAANALYTICS Traditional analytics Big Data Analytics Analytics using know data which is well understood. Not well understood data format for it largely being unstructured and semi structured. Build based on relational data base model. Big data comes in various forms and formats from multiple disconnected system. They are almost flat with no relationship.
  • 13.
    4 TYPES OF ANALYTICS 1.Descriptive : what happened ?? 2. Diagnostic : why did it happened ?? 3. Predictive : what is likely to happen ?? 4. Prescriptive : what should I do about it ??
  • 14.
    APPROACH TO ANALYTICS 1.Identify the data sources. 2. Select the right tools and technology to collect, store, aggregate the data. 3. Understand the business domain. 4. Identify tools and technology to process the data. 5. Build mathematical models for the analytics . 6. Visualize. 7. Validate your result. 8. Learn, adopt, and rebuild your analytical model.
  • 15.
    ANALYTICS TOOLS Most usedstatistical programming tools are: • IBM SPSS • SAS • R • MATLAB R and MATLAB have the most comprehensive support of statistical functions.
  • 16.
    HADOOP Hadoop is aframework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model . • Software framework that supports distributed applications, licensed under the Apache v2 license. • Hadoop was derived from Google's MapReduce and Google File System papers. • YAHOO is the largest contributor to the project • Written in the Java programming language .
  • 17.
  • 18.
    WHY USE HADOOP? • Need to compress data • Nodes fail every day • Common infrastructure Efficient Easy to use Open Source
  • 19.
    COMMON USES • Searches •Log processing • Recommendation systems • Analytics (Facebook, Linkedin) • Image and video processing (NASA) • Data retention
  • 20.
    TECHNOLOGIES AND TOOLS Unstructured andsemi-structured data types typically don't fit well in traditional data warehouses that are based on relational databases oriented to structured data sets. As a result, many organizations that collect, process and analyze big data turn to NoSQL databases as well as Hadoop and its companion tools, including:
  • 21.
    MapReduce: a softwareframework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. YARN: a cluster management technology and one of the key features in second-generation Hadoop. Spark: an open-source parallel processing framework that enables users to run large-scale data analytics applications across clustered systems.
  • 22.
    HBase: a column-orientedkey/value data store built to run on top of the Hadoop Distributed File System (HDFS). Hive: an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Kafka: a distributed publish-subscribe messaging system designed to replace traditional message brokers. Pig: an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters.
  • 23.
    BIG DATA ANALYTICS BENEFITS •Driven by specialized analytics systems and software, big data analytics can point the way to various business benefits, including new revenue opportunities, more effective marketing, better customer service, improved operational efficiency and competitive advantages over rivals.
  • 24.
    • Big dataanalytics applications enable data scientists, predictive modelers, statisticians and other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional business intelligence (BI) and analytics programs. • On a broad scale, data analytics technologies and techniques provide a means of analyzing data sets and drawing conclusions about them to help organizations make informed business decisions.
  • 25.
    BIG DATA ANALYTICS APPLICATION •Government : The use and adoption of big data within governmental processes allows efficiencies in terms of cost, productivity, and innovation, but does not come without its flaws. • Manufacturing: Based on TCS 2013 Global Trend Study, improvements in supply planning and product quality provide the greatest benefit of big data for manufacturing.
  • 26.
    • Information Technology:Especially since 2015, big data has come to prominence within Business Operations as a tool to help employees work more efficiently and streamline the collection and distribution of Information Technology (IT). • Education: A McKinsey Global Institute study found a shortage of 1.5 million highly trained data professionals and managers and a number of universities including University of Tennessee and UC Berkeley, have created masters programs to meet this demand.
  • 27.

Editor's Notes

  • #6 [1] http://e27.co/worlds-data-volume-to-grow-40-per-year-50-times-by-2020-aureus-20150115-2/