Arun Kumar
MSc(Computer Science),
Don Bosco College Yelagirihills.
10/3/20181 Don Bosco College, Yelagiri hills.
Outline
 Big Data : An Introduction
 Big Data Analytics
 Big Data Analytics : Applications and Business
prosperity
 Big Data Technology
 Big Data : Issues and Challenges
 Conclusion
10/3/20182 Don Bosco College, Yelagiri hills.
Big Data:
An Introduction
10/3/20183 Don Bosco College, Yelagiri hills.
Introduction
4
 Data
 Facts and piece of information collected together
for reference or analysis
 Information processed or stored by computer &
other electronic devices
 Text, image, audio, video, etc.,
10/3/2018Don Bosco College, Yelagiri hills.
Introduction
10/3/20185
 Big data is similar to data, but it’s not behave the
same
 The term ‘big data’ applies to information that cannot be
processed or handled using traditional processes or tools
1 8 bit
1024
byte
1024
kilobyte
1024
megabyte
1024
Gigabyte
1024
Terabyte
1024
petabyte
1024
Exabyte
1024
zeta byte
Bit
Byte
Kilobyte
Megabyte
Gigabyte
Terabyte
Petabyte
Exabyte
Zetabyte
Yottobye
Don Bosco College, Yelagiri hills.
Definition
10/3/20186
 There is no single standard definition.
 Big data is high-volume, high-velocity and high-
variety information assets that demand cost-effective,
innovative forms of information processing for enhanced
insight and decision making.
-Gartner.
 “Big data exceeds the reach of commonly used hardware
environments and software tools to capture, manage,
and process it with in a tolerable elapsed time for its user
population.” -Teradata Magazine article,
2011.
Don Bosco College, Yelagiri hills.
Introduction
 Characteristics of Big Data
Big Data
Velocity
Variety
Volume
10/3/20187 Don Bosco College, Yelagiri hills.
Introduction
 Characteristics of Big Data.
 Volume:
 Huge size of data (Tera byte to Peta byte) at rest.
 Velocity:
 Data in motion (streaming data).
 Variety:
 Varieties of data (image, audio, text, video, etc).
10/3/20188 Don Bosco College, Yelagiri hills.
Introduction
 Characteristics of Big Data
 Now researchers include more V’s
 Veracity
 Value
 Variability
.
.
.
.
 Victory
10/3/20189 Don Bosco College, Yelagiri hills.
Volume
10/3/201810 Don Bosco College, Yelagiri hills.
Variety
10/3/201811 Don Bosco College, Yelagiri hills.
Velocity
10/3/201812 Don Bosco College, Yelagiri hills.
Sources of Big Data
13
 What is big data?
 Every day, we create 2.5 quintillion bytes of data
— so much that 90% of the data in the world today has been created
in the last two years alone.
 Data comes from everywhere:
 sensors used to gather climate information
 posts to social media sites
 digital pictures and videos
 purchase transaction records
 cell phone GPS signals, etc.
 This data is big data.
10/3/2018Don Bosco College, Yelagiri hills.
Web & Ecommerce
BECOMES
BIG
DATABank/Credit card
Transactional
Mobile
Social
Video & Preference
Machine & Sensor
Retail POS
Sources of Big Data
10/3/201814 Don Bosco College, Yelagiri hills.
Who is generating big data?
10/3/201815
 The Model of Generating/Consuming Data has
Changed
Old Model: Few companies are generating data, all others are consuming
data
New Model: all of us are generating data, and all of us are consuming
data
Don Bosco College, Yelagiri hills.
10/3/201816 Don Bosco College, Yelagiri hills.
What we know or see
What’s actually there
What does Big Data look like ?
10/3/201817 Don Bosco College, Yelagiri hills.
Area of Applications
10/3/201818
 Health care / Biotech.
 E – Governance.
 Social Networks /
Social Media.
 Weather Forecasting.
 Education data.
Don Bosco College, Yelagiri hills.
Area of Applications
10/3/201819
 Banking / Insurance / Finance.
 Retail industries.
 CRM / Customer Analytics.
 Airways and etc.,.
Don Bosco College, Yelagiri hills.
Big Data Analytics
10/3/201820 Don Bosco College, Yelagiri hills.
Definition
 Big data analytics is the process of examining
enormous amounts of data of a variety of types to
uncover hidden patterns, unknown correlations and other
useful information.
 Example:
Searches in “friends” networks at social-networking
sites, involve graphs with hundreds of millions of nodes
and many billions of edges.
10/3/201821 Don Bosco College, Yelagiri hills.
Why Big Data Analytics Feasible?
10/3/2018Don Bosco College, Yelagiri hills.22
 Increased storage capacities
 Next generation products
 Cost Reduction
 Faster and better decision making
 Communication networking
 Improved services or products
 Distributed processing technologies
Stages in Big Data Analytics
10/3/201823 Don Bosco College, Yelagiri hills.
Available Analytic Methods
 Traditional Data Processing systems
 Information Processing using statistical tools
 Knowledge Engineering and Intelligence Systems
 Business Analytics using Data mining
 Business Intelligence
 Genetic Algorithms
 Machine learning algorithms
 Exploratory data analysis and etc.,
10/3/201824 Don Bosco College, Yelagiri hills.
Types of Big Data Analytics
10/3/201825
Analytics
Descriptive:
what is
happened?
Predictive:
what will
happen?
Prescriptive:
What
should
happen?
Don Bosco College, Yelagiri hills.
Capture
Organize
IntegrateAnalyze
Act
The Cycle of Big Data Management
10/3/201826 Don Bosco College, Yelagiri hills.
 Analysis of data is a process of,
with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making.
Activities in Analytics
 Inspecting
 Cleaning
 Transforming
 modeling
10/3/201827 Don Bosco College, Yelagiri hills.
Why new analytical method needed?
 Big in Size – (Volume)
 Unstructured data – (Variety)
 To analyze the streaming data (High-Velocity)
 Distributed
 Need of parallel analytics
10/3/201828 Don Bosco College, Yelagiri hills.
Big Data Technology
10/3/201829 Don Bosco College, Yelagiri hills.
Key Technologies for Big data
 DFS (Distributed File System):
 Large files are split into parts
 Move file parts into a cluster
 Fault-tolerant through replication across nodes while being rack-
aware
 MapReduce:
Move algorithms close to the data by structuring them for
parallel execution so that each task works on a part of the data. The
power of Simplicity!
 NoSQL:
A NoSQL (often interpreted as Not Only SQL) database
provides a mechanism for storage and retrieval of data that is modeled
in means other than the tabular relations used in relational databases.
10/3/201830 Don Bosco College, Yelagiri hills.
Key Technologies for Big data
Three key technologies that can help to handle big data:
 Information management for big data: Manage data as
a strategic, core asset, with ongoing process control
High-performance analytics for big data: Gain rapid
insights from big data and the ability to solve increasingly
complex problems
Flexible deployment options for big data: Choose
between options for on premises or hosted, software-as-a-
service (SaaS) approaches
10/3/201831 Don Bosco College, Yelagiri hills.
 Fast Processors and Massively Parallel Processing
(MPP)
 Distributed File System
 Apache Hadoop
 Data Intensive Computing Strategies
 Low cost storages, In-Memory Processing
Technologies for Big data
10/3/201832 Don Bosco College, Yelagiri hills.
 Hadoop Distributions
 Hortonworks
 Cloud Operating System
 Cloud Foundry — By VMware
 OpenStack — Worldwide participation and well-known
companies
 Storage
 fusion-io — Not open source, but very supportive of Open
Source projects; Flash-aware applications.
10/3/2018Don Bosco College, Yelagiri hills.33
Technologies for Big data
 Python — Awesome programming language.
 Mahout — Machine learning programming
language.
 R — Best among Data mining tools.
 Storm — Stream processing by Twitter.
 Giraph — Graph processing by Facebook.
10/3/2018Don Bosco College, Yelagiri hills.34
Development Platforms and Tools
 NoSQL Databases
 MongoDB
 Cassandra
 Hbase (Hadoop)
 SQL Databases
 MySql — Belongs to Oracle
 PostgreSQL — Object Relational Database
 TokuDB — Improves RDBMS performance
10/3/2018Don Bosco College, Yelagiri hills.35
Databases
Visualization tools
10/3/2018Don Bosco College, Yelagiri hills.36
 Maps
 Charts (pie, bar, plot, etc)
 Graphs
Big Data: Issues &
Challenges
10/3/201837 Don Bosco College, Yelagiri hills.
Challenges
10/3/201838
The Bottleneck is…..
 In technology
 New architecture, algorithms, techniques are needed
 Also in technical skills
 Lack of experts in using the new technology
Don Bosco College, Yelagiri hills.
Data sources
Big Data Analytics
10/3/201839 Don Bosco College, Yelagiri hills.
Challenges
Internet of Things related
 The amount of data needed to sort, improve, integrate,
analyze and manage is huge.
 Sensor devices, constantly chattering updates about
moisture, light, movement
 Real-time stream data analytics platform that can handle
Big Data and a scalable infrastructure to support it.
10/3/201840 Don Bosco College, Yelagiri hills.
Challenges
Cloud computing related
 Traditional WAN-based transport methods cannot move
terabytes of data at the speed dictated by businesses
10/3/201841 Don Bosco College, Yelagiri hills.
Classified Issues & Challenges
 Storage
 Management
 Processing
 Visualization
10/3/201842 Don Bosco College, Yelagiri hills.
Challenges: Storage related
 Clearly not enough hard disks/devices.
 Distributed storage is still not enough, manufacturers
cannot make enough storage devices in time.
 Speed in writing to devices, bigger data paths/data-bus
10/3/201843 Don Bosco College, Yelagiri hills.
Challenges: Management related
 Data Collection
 Organize the varieties of data
 Need of distributed environments
 Need of new analytical methodology
10/3/201844 Don Bosco College, Yelagiri hills.
Challenges: Processing related
 Integrating data using Filters
 “What” Data and “How” ?
 Effective Data processing system Design
 Latency and Bandwidth
 Streaming data processing
10/3/201845 Don Bosco College, Yelagiri hills.
Challenges: Big data visualization
 Meeting the need for speed
 Understanding the data
 Addressing data quality
 Displaying meaningful results
10/3/201846 Don Bosco College, Yelagiri hills.
Conclusion
10/3/201847 Don Bosco College, Yelagiri hills.
For Researchers
 Research institutes and companies invite more data
scientists for the research and development.
 Research opportunities in R & D in the respective fields
such as
 Telecom industry
 Retail industry
 Social networks
 Healthcare industry and so on.
10/3/201848
For Students
10/3/201849
 Develop deep analytical skills to grab Analyst positions
 Basic knowledge about Optimization techniques, Data
mining, Machine Learning algorithms, etc.
 Keep an eye on evolving technologies
Thank you
10/3/201850 Don Bosco College, Yelagiri hills.

Big Data analytics

  • 1.
    Arun Kumar MSc(Computer Science), DonBosco College Yelagirihills. 10/3/20181 Don Bosco College, Yelagiri hills.
  • 2.
    Outline  Big Data: An Introduction  Big Data Analytics  Big Data Analytics : Applications and Business prosperity  Big Data Technology  Big Data : Issues and Challenges  Conclusion 10/3/20182 Don Bosco College, Yelagiri hills.
  • 3.
    Big Data: An Introduction 10/3/20183Don Bosco College, Yelagiri hills.
  • 4.
    Introduction 4  Data  Factsand piece of information collected together for reference or analysis  Information processed or stored by computer & other electronic devices  Text, image, audio, video, etc., 10/3/2018Don Bosco College, Yelagiri hills.
  • 5.
    Introduction 10/3/20185  Big datais similar to data, but it’s not behave the same  The term ‘big data’ applies to information that cannot be processed or handled using traditional processes or tools 1 8 bit 1024 byte 1024 kilobyte 1024 megabyte 1024 Gigabyte 1024 Terabyte 1024 petabyte 1024 Exabyte 1024 zeta byte Bit Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zetabyte Yottobye Don Bosco College, Yelagiri hills.
  • 6.
    Definition 10/3/20186  There isno single standard definition.  Big data is high-volume, high-velocity and high- variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. -Gartner.  “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” -Teradata Magazine article, 2011. Don Bosco College, Yelagiri hills.
  • 7.
    Introduction  Characteristics ofBig Data Big Data Velocity Variety Volume 10/3/20187 Don Bosco College, Yelagiri hills.
  • 8.
    Introduction  Characteristics ofBig Data.  Volume:  Huge size of data (Tera byte to Peta byte) at rest.  Velocity:  Data in motion (streaming data).  Variety:  Varieties of data (image, audio, text, video, etc). 10/3/20188 Don Bosco College, Yelagiri hills.
  • 9.
    Introduction  Characteristics ofBig Data  Now researchers include more V’s  Veracity  Value  Variability . . . .  Victory 10/3/20189 Don Bosco College, Yelagiri hills.
  • 10.
    Volume 10/3/201810 Don BoscoCollege, Yelagiri hills.
  • 11.
    Variety 10/3/201811 Don BoscoCollege, Yelagiri hills.
  • 12.
    Velocity 10/3/201812 Don BoscoCollege, Yelagiri hills.
  • 13.
    Sources of BigData 13  What is big data?  Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.  Data comes from everywhere:  sensors used to gather climate information  posts to social media sites  digital pictures and videos  purchase transaction records  cell phone GPS signals, etc.  This data is big data. 10/3/2018Don Bosco College, Yelagiri hills.
  • 14.
    Web & Ecommerce BECOMES BIG DATABank/Creditcard Transactional Mobile Social Video & Preference Machine & Sensor Retail POS Sources of Big Data 10/3/201814 Don Bosco College, Yelagiri hills.
  • 15.
    Who is generatingbig data? 10/3/201815  The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data Don Bosco College, Yelagiri hills.
  • 16.
    10/3/201816 Don BoscoCollege, Yelagiri hills.
  • 17.
    What we knowor see What’s actually there What does Big Data look like ? 10/3/201817 Don Bosco College, Yelagiri hills.
  • 18.
    Area of Applications 10/3/201818 Health care / Biotech.  E – Governance.  Social Networks / Social Media.  Weather Forecasting.  Education data. Don Bosco College, Yelagiri hills.
  • 19.
    Area of Applications 10/3/201819 Banking / Insurance / Finance.  Retail industries.  CRM / Customer Analytics.  Airways and etc.,. Don Bosco College, Yelagiri hills.
  • 20.
    Big Data Analytics 10/3/201820Don Bosco College, Yelagiri hills.
  • 21.
    Definition  Big dataanalytics is the process of examining enormous amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information.  Example: Searches in “friends” networks at social-networking sites, involve graphs with hundreds of millions of nodes and many billions of edges. 10/3/201821 Don Bosco College, Yelagiri hills.
  • 22.
    Why Big DataAnalytics Feasible? 10/3/2018Don Bosco College, Yelagiri hills.22  Increased storage capacities  Next generation products  Cost Reduction  Faster and better decision making  Communication networking  Improved services or products  Distributed processing technologies
  • 23.
    Stages in BigData Analytics 10/3/201823 Don Bosco College, Yelagiri hills.
  • 24.
    Available Analytic Methods Traditional Data Processing systems  Information Processing using statistical tools  Knowledge Engineering and Intelligence Systems  Business Analytics using Data mining  Business Intelligence  Genetic Algorithms  Machine learning algorithms  Exploratory data analysis and etc., 10/3/201824 Don Bosco College, Yelagiri hills.
  • 25.
    Types of BigData Analytics 10/3/201825 Analytics Descriptive: what is happened? Predictive: what will happen? Prescriptive: What should happen? Don Bosco College, Yelagiri hills.
  • 26.
    Capture Organize IntegrateAnalyze Act The Cycle ofBig Data Management 10/3/201826 Don Bosco College, Yelagiri hills.
  • 27.
     Analysis ofdata is a process of, with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Activities in Analytics  Inspecting  Cleaning  Transforming  modeling 10/3/201827 Don Bosco College, Yelagiri hills.
  • 28.
    Why new analyticalmethod needed?  Big in Size – (Volume)  Unstructured data – (Variety)  To analyze the streaming data (High-Velocity)  Distributed  Need of parallel analytics 10/3/201828 Don Bosco College, Yelagiri hills.
  • 29.
    Big Data Technology 10/3/201829Don Bosco College, Yelagiri hills.
  • 30.
    Key Technologies forBig data  DFS (Distributed File System):  Large files are split into parts  Move file parts into a cluster  Fault-tolerant through replication across nodes while being rack- aware  MapReduce: Move algorithms close to the data by structuring them for parallel execution so that each task works on a part of the data. The power of Simplicity!  NoSQL: A NoSQL (often interpreted as Not Only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. 10/3/201830 Don Bosco College, Yelagiri hills.
  • 31.
    Key Technologies forBig data Three key technologies that can help to handle big data:  Information management for big data: Manage data as a strategic, core asset, with ongoing process control High-performance analytics for big data: Gain rapid insights from big data and the ability to solve increasingly complex problems Flexible deployment options for big data: Choose between options for on premises or hosted, software-as-a- service (SaaS) approaches 10/3/201831 Don Bosco College, Yelagiri hills.
  • 32.
     Fast Processorsand Massively Parallel Processing (MPP)  Distributed File System  Apache Hadoop  Data Intensive Computing Strategies  Low cost storages, In-Memory Processing Technologies for Big data 10/3/201832 Don Bosco College, Yelagiri hills.
  • 33.
     Hadoop Distributions Hortonworks  Cloud Operating System  Cloud Foundry — By VMware  OpenStack — Worldwide participation and well-known companies  Storage  fusion-io — Not open source, but very supportive of Open Source projects; Flash-aware applications. 10/3/2018Don Bosco College, Yelagiri hills.33 Technologies for Big data
  • 34.
     Python —Awesome programming language.  Mahout — Machine learning programming language.  R — Best among Data mining tools.  Storm — Stream processing by Twitter.  Giraph — Graph processing by Facebook. 10/3/2018Don Bosco College, Yelagiri hills.34 Development Platforms and Tools
  • 35.
     NoSQL Databases MongoDB  Cassandra  Hbase (Hadoop)  SQL Databases  MySql — Belongs to Oracle  PostgreSQL — Object Relational Database  TokuDB — Improves RDBMS performance 10/3/2018Don Bosco College, Yelagiri hills.35 Databases
  • 36.
    Visualization tools 10/3/2018Don BoscoCollege, Yelagiri hills.36  Maps  Charts (pie, bar, plot, etc)  Graphs
  • 37.
    Big Data: Issues& Challenges 10/3/201837 Don Bosco College, Yelagiri hills.
  • 38.
    Challenges 10/3/201838 The Bottleneck is….. In technology  New architecture, algorithms, techniques are needed  Also in technical skills  Lack of experts in using the new technology Don Bosco College, Yelagiri hills.
  • 39.
    Data sources Big DataAnalytics 10/3/201839 Don Bosco College, Yelagiri hills.
  • 40.
    Challenges Internet of Thingsrelated  The amount of data needed to sort, improve, integrate, analyze and manage is huge.  Sensor devices, constantly chattering updates about moisture, light, movement  Real-time stream data analytics platform that can handle Big Data and a scalable infrastructure to support it. 10/3/201840 Don Bosco College, Yelagiri hills.
  • 41.
    Challenges Cloud computing related Traditional WAN-based transport methods cannot move terabytes of data at the speed dictated by businesses 10/3/201841 Don Bosco College, Yelagiri hills.
  • 42.
    Classified Issues &Challenges  Storage  Management  Processing  Visualization 10/3/201842 Don Bosco College, Yelagiri hills.
  • 43.
    Challenges: Storage related Clearly not enough hard disks/devices.  Distributed storage is still not enough, manufacturers cannot make enough storage devices in time.  Speed in writing to devices, bigger data paths/data-bus 10/3/201843 Don Bosco College, Yelagiri hills.
  • 44.
    Challenges: Management related Data Collection  Organize the varieties of data  Need of distributed environments  Need of new analytical methodology 10/3/201844 Don Bosco College, Yelagiri hills.
  • 45.
    Challenges: Processing related Integrating data using Filters  “What” Data and “How” ?  Effective Data processing system Design  Latency and Bandwidth  Streaming data processing 10/3/201845 Don Bosco College, Yelagiri hills.
  • 46.
    Challenges: Big datavisualization  Meeting the need for speed  Understanding the data  Addressing data quality  Displaying meaningful results 10/3/201846 Don Bosco College, Yelagiri hills.
  • 47.
    Conclusion 10/3/201847 Don BoscoCollege, Yelagiri hills.
  • 48.
    For Researchers  Researchinstitutes and companies invite more data scientists for the research and development.  Research opportunities in R & D in the respective fields such as  Telecom industry  Retail industry  Social networks  Healthcare industry and so on. 10/3/201848
  • 49.
    For Students 10/3/201849  Developdeep analytical skills to grab Analyst positions  Basic knowledge about Optimization techniques, Data mining, Machine Learning algorithms, etc.  Keep an eye on evolving technologies
  • 50.
    Thank you 10/3/201850 DonBosco College, Yelagiri hills.