BIG DATA ANALYTICS
Contents
• What is Big Data
• C/C’s of Big Data
• Structure of Big Data
• Big Data approaches
• Issues in legacy systems
• Hadoop
• Big Data analytics
• Types of Big Data Analytics
• Analytics life cycle
• Analytics tools
What is Big Data!
“Big Data” is similar to “small data”, but bigger in size. having data
bigger it requires different approaches, Techniques, tools, frameworks
and architecture
It is the Technology which deals with large and complex dataset which
are varied in data format and structures, does not fit into the memory.
Big Data generates value from the storage and processing of very large
quantities of digital information that cannot be analyzed with
traditional computing techniques and finds new insight into the
existing data and guidelines to capture and analyze future data
Every minute we send 204 million emails, generate 1.8 million Facebook
likes, send 278 thousand Tweets, and up-load 200,000 photos to
Facebook
What is Big Data!
Old Model: Few companies are generating data, all others are
consuming data
New Model: all of us are generating data, and all of us are consuming
data
Characteristic's of Big Data 5Vs
ValueVeracityVarietyVelocityVolume
Data at
Scale
Data in
motion
Data in
many forms
Data
uncertainty
Data in map
reduce
1st Volume
..refers to the vast amounts of data generated every second. We are not
talking Terabytes but Zettabytes.
..If we take all the data generated in the world between the beginning of
time and 2000, the same amount of data will soon be generated every
minute
..44x increase from 2009 to 2020 From
0.8 zettabytes to 35zb
Characteristic's of Big Data 5Vs
Characteristic's of Big Data 5Vs
2nd Velocity
..refers to the speed at which new data is generated and the speed at which data
moves around.
Just think of:
• Social media messages going viral in seconds
• High-frequency stock trading algorithms reflect market changes
within microseconds
• Machine to machine processes exchange data between billions of
devices infrastructure and sensors generate massive log data in
realtime
• On-line gaming systems support millions of concurrent users, each
producing multiple inputs per second.
Characteristic's of Big Data 5Vs
3rd Variety
..refers to the different types of data we can now use. In the past we
only focused on structured data that neatly fitted into tables or
relational databases, such as financial data. In fact, 80% of the world’s
data is unstructured (text, images, video, voice, etc.)
Characteristic's of Big Data 5Vs
4th Veracity
..refers to the messiness or trustworthiness of the data. With many
forms of big data quality and accuracy are less controllable (just think
of Twitter posts with hash tags, abbreviations, typos and colloquial
speech as well as the reliability and accuracy of content
Characteristic's of Big Data 5Vs
5th Value
..refers to how data is useful for us, we have access to big data but
unless we can turn it into value it is useless. It can be easily established
that ‘value’ is the most important V of Big Data Map Reduce
Map Reduce is a processing technique for distributed computing based
on Java.
Hadoop is the most popular implementation of Map Reduce
because of ease of availability as it is an entirely open source
platform for handling Big Data.
Structured:
• Most traditional data sources
Semi-structured:
• Many sources of big data
Unstructured:
• Video data, audio data
Structure of Big Data
Big Data approach (platform)
• Process any type of Data
(Structured, unstructured or semi)
• Built for purpose engines
(Designed to handle different requirements)
• Manage and govern Data in the ecosystem
• Enterprise Data integration
• Grow and evolve on current infrastructure
Issues in legacy systems
• Limited Storage Capacity
• Limited Processing Capacity
• No Scalability
• Single point of Failure
• Sequential Processing
• RDBMSs can handle Structured Data
• Requires preprocessing of Data
• Information is collected according to
• current business needs
says he has a solution to our BIG problem !
Apache Hadoop: Is A Framework That Allows For The Distributed Processing Of Large
Datasets Across Clusters Of Commodity Computers Using A Simple Programming Model.
Hadoop Approach
HDFS(Hadoop Distributed File System)
• Highly Fault tolerant , distributed ,
reliable , scalable file system for
data storage.
• Stores multiple copies of data on
different nodes
• A File is split up into blocks and
stored on multiple machines
• Hadoop cluster typically has a
single namenode and no. of data
nodes to form a hadoop cluster.
Hadoop Approach
MAP REDUCE
Is a programming model that is simultaneously process and analyzes
huge data sets logically into separate clusters , while Map sorts the
data, Reduce segregates in to logical clusters, thus removing ‘bad’ data
and retaining the necessary information
ANALYTICS IS IN YOUR BLOOD
Do you realize that you do analytics everyday?
I need to go to campus faster!
Hmm.. Looking at the sky today, I think it’ll be rain
Based on my mid term and assignment score, I need to get at least 80
in my final exam to pass this course
I stalked her social media. I think she is single because most of her
post only about food :p
Big Data Analytics
• Examining large amount of data
• Appropriate information
• Identification of hidden patterns, unknown correlations
• Competitive advantage
• Better business decisions: strategic and operational
• Effective marketing, customer satisfaction, increased
• revenue
Smarter
Healthcare
Traffic Control
Manufacturing
Quality Search
Types of Big Data Analytics
Descriptive statistics is the term given to the analysis of data that helps
describe, show or summarize data in a meaningful way such that, for
example, patterns might emerge from the data.
i.e: Fulan only post his activity on Facebook at weekend
Predictive analytics is the branch of data mining concerned with the
prediction of future probabilities and trends.
i.e: Fulan should be has a job. Because he always left home at 7 in the
morning and get back at 6 afternoon
types of predictive analytics:
• Supervised analytics is when we know the truth about something in the
past
• Unsupervised analytics is when we don’t know the truth about something
in the past. The result is segment that we need to interpret
Analytics Life Cycle
Analytics Life Cycle
Analytics Tools
Thank You

Intro big data analytics

  • 1.
  • 2.
    Contents • What isBig Data • C/C’s of Big Data • Structure of Big Data • Big Data approaches • Issues in legacy systems • Hadoop • Big Data analytics • Types of Big Data Analytics • Analytics life cycle • Analytics tools
  • 4.
    What is BigData! “Big Data” is similar to “small data”, but bigger in size. having data bigger it requires different approaches, Techniques, tools, frameworks and architecture It is the Technology which deals with large and complex dataset which are varied in data format and structures, does not fit into the memory. Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques and finds new insight into the existing data and guidelines to capture and analyze future data Every minute we send 204 million emails, generate 1.8 million Facebook likes, send 278 thousand Tweets, and up-load 200,000 photos to Facebook
  • 5.
    What is BigData! Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 6.
    Characteristic's of BigData 5Vs ValueVeracityVarietyVelocityVolume Data at Scale Data in motion Data in many forms Data uncertainty Data in map reduce
  • 7.
    1st Volume ..refers tothe vast amounts of data generated every second. We are not talking Terabytes but Zettabytes. ..If we take all the data generated in the world between the beginning of time and 2000, the same amount of data will soon be generated every minute ..44x increase from 2009 to 2020 From 0.8 zettabytes to 35zb Characteristic's of Big Data 5Vs
  • 8.
    Characteristic's of BigData 5Vs 2nd Velocity ..refers to the speed at which new data is generated and the speed at which data moves around. Just think of: • Social media messages going viral in seconds • High-frequency stock trading algorithms reflect market changes within microseconds • Machine to machine processes exchange data between billions of devices infrastructure and sensors generate massive log data in realtime • On-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
  • 9.
    Characteristic's of BigData 5Vs 3rd Variety ..refers to the different types of data we can now use. In the past we only focused on structured data that neatly fitted into tables or relational databases, such as financial data. In fact, 80% of the world’s data is unstructured (text, images, video, voice, etc.)
  • 10.
    Characteristic's of BigData 5Vs 4th Veracity ..refers to the messiness or trustworthiness of the data. With many forms of big data quality and accuracy are less controllable (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content
  • 11.
    Characteristic's of BigData 5Vs 5th Value ..refers to how data is useful for us, we have access to big data but unless we can turn it into value it is useless. It can be easily established that ‘value’ is the most important V of Big Data Map Reduce Map Reduce is a processing technique for distributed computing based on Java. Hadoop is the most popular implementation of Map Reduce because of ease of availability as it is an entirely open source platform for handling Big Data.
  • 12.
    Structured: • Most traditionaldata sources Semi-structured: • Many sources of big data Unstructured: • Video data, audio data Structure of Big Data
  • 13.
    Big Data approach(platform) • Process any type of Data (Structured, unstructured or semi) • Built for purpose engines (Designed to handle different requirements) • Manage and govern Data in the ecosystem • Enterprise Data integration • Grow and evolve on current infrastructure
  • 14.
    Issues in legacysystems • Limited Storage Capacity • Limited Processing Capacity • No Scalability • Single point of Failure • Sequential Processing • RDBMSs can handle Structured Data • Requires preprocessing of Data • Information is collected according to • current business needs
  • 15.
    says he hasa solution to our BIG problem ! Apache Hadoop: Is A Framework That Allows For The Distributed Processing Of Large Datasets Across Clusters Of Commodity Computers Using A Simple Programming Model.
  • 16.
    Hadoop Approach HDFS(Hadoop DistributedFile System) • Highly Fault tolerant , distributed , reliable , scalable file system for data storage. • Stores multiple copies of data on different nodes • A File is split up into blocks and stored on multiple machines • Hadoop cluster typically has a single namenode and no. of data nodes to form a hadoop cluster.
  • 17.
    Hadoop Approach MAP REDUCE Isa programming model that is simultaneously process and analyzes huge data sets logically into separate clusters , while Map sorts the data, Reduce segregates in to logical clusters, thus removing ‘bad’ data and retaining the necessary information
  • 18.
    ANALYTICS IS INYOUR BLOOD Do you realize that you do analytics everyday? I need to go to campus faster! Hmm.. Looking at the sky today, I think it’ll be rain Based on my mid term and assignment score, I need to get at least 80 in my final exam to pass this course I stalked her social media. I think she is single because most of her post only about food :p
  • 19.
    Big Data Analytics •Examining large amount of data • Appropriate information • Identification of hidden patterns, unknown correlations • Competitive advantage • Better business decisions: strategic and operational • Effective marketing, customer satisfaction, increased • revenue Smarter Healthcare Traffic Control Manufacturing Quality Search
  • 20.
    Types of BigData Analytics Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. i.e: Fulan only post his activity on Facebook at weekend Predictive analytics is the branch of data mining concerned with the prediction of future probabilities and trends. i.e: Fulan should be has a job. Because he always left home at 7 in the morning and get back at 6 afternoon types of predictive analytics: • Supervised analytics is when we know the truth about something in the past • Unsupervised analytics is when we don’t know the truth about something in the past. The result is segment that we need to interpret
  • 21.
  • 22.
  • 23.
  • 24.

Editor's Notes

  • #21 There is 2 types of predictive analytics: ◦ Supervised Supervised analytics is when we know the truth about something in the past Example: we have historical weather data. The temperature, humidity, cloud density and weather type (rain, cloudy, or sunny). Then we can predict today weather based on temp, humidity, and cloud density today Machine learning to be used: Regression, decision tree, SVM, ANN, etc. ◦ Unsupervised Unsupervised is when we don’t know the truth about something in the past. The result is segment that we need to interpret Example: We want to do segmentation over the student based on the historical exam score, attendance, and late history