Big Data Analytics
& Trends
Presentation
by
Dr.K.Sreenivasa Rao
Dept. of CSE, VBIT
Content
1. What is Big data ?
2. Why Big data ?
3. Some Definitions.
4. Types of data-Structured, Unstructured & Semi
structured
5. The data Landscape
6. Some other definitions
7. Characteristics of big data
8. Data generation Points
9. Big Data analytics
10.Example Scenario
11.Challenges of Big data
12.Hadoop, History & Complementary Packages
13.Difference between Big data & Data Science.
14.Salary Trends in Hadoop/Big Data
What is Big data?
•Facebook generates 10TB daily
•Twitter generates 7TB of data Daily
•IBM claims 90% of today’s stored data was generated
in just the last two years.
Why Big Data ?
• Growth of Big Data is needed because of
– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
– Every day we create 2.5 Million TB[quintillion bytes(1
Quintillionbyte= 1 Exabyte=1000Petabytes where 1
Petabyte=1000 TB)] of data; 90% of the data in the
world today has been created in the last two years
alone.
• FB generates 10TB daily
• Twitter generates 7TB of data Daily
• IBM claims 90% of today’s stored data was generated in
just the last two years.
Some Definitions
• Big data is a "catch all" word, related to the
power of using a lot of data to solve
problems.. Big data is the data that is large
enough and complex that it becomes
difficult to process using a single
computer...
• Big data is simply the large sets of data that
businesses and other parties put together to
serve specific goals and operations. Big data
can include many different kinds of data in
many different kinds of formats.
Some Definitions
• Big data is an evolving term that describes any
voluminous amount of structured,
semi structured and unstructured data that
has the potential to be mined for information.
[Ref:
Strata + Hadoop World 2016: Hadoop and Spark in spotlight]
RDF-Resource Description Framework
Some Other Definitions
• Gartner defines Big Data as high volume, velocity and
variety information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.
• Big data is often characterized by 3Vs: the
extreme volume of data, the wide variety of data types
and the velocity at which the data must be processed.
Although big data doesn't equate to any specific
volume of data, the term is often used to describe
Terabytes, Petabytes and even Exabytes of data
captured over time.
Characteristics of Big data
Volume: (Data Quantity)
• Twitter generates about 80 MB per second.
• Facebook generates 10 TB data per day.
• Black box data: Single flight generates nearly 10 TB of data per
every ½ an hour.
• Twitter generates of about 80 MB every second.
Velocity: (Data Speed) ebay analyzes 5 million transactions per day.
• Finally, velocity refers to the speed at which big data must be
analyzed. Velocity is also meaningful, as big data analysis expands
into fields like machine learning and artificial intelligence, where
analytical processes mimic perception by finding and using patterns
in the collected data.
Variety: (Data Types) Bigdata includes data from e-commerce sites,
health care data, education, stock exchange, banking etc…..
Varying in Time:
• [http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data]
• http://www.information-management.com/news/big-data-analytics/the-
Data generation Points Examples
Mobile Devices
Readers/Scanners
Science facilities
Microphones
Cameras
Social Media
Programs/ Software
Big Data Analytics
• Examining large amount of data
• Appropriate information
• Identification of hidden patterns, unknown correlations
• Competitive advantage
• Better business decisions: Strategic and Operational
• Effective marketing, customer satisfaction, increased
revenue
Example Scenario
U need reading articles,
Pictures & videos, links to
facebook & twitter etc….
Pictures & reading articles
Watching Videos etc… still have no clarity….
Such bigdata is to be sorted, filtered &
analyzed to produce useful information
for decision making.
For haps facebook may help u better to identify best
gym equipment for your office…..
Finally Analytics gives us useful insight or information
from big data.
Challenges of big data:
• Problem: To read 1 TB data from a hard drive
• Sol1: 1 machine of 4 I/O channels of 100 MBps
• 1 TB=1024*1024 MB
• 10,48,576 MB
• =10, 485 Seconds
• =174.75 Minutes by 1 i/o channel
• =174.75/4
• =43.6 Minutes for by 4 i/o channels
• Sol2: If 10 machines are used for reading it takes
43.6/10=4.36 minutes to read 1 TB data.
• i.e to analyze big data, first we need to read it,
today challenge is i/o speed but not storage
capacity.
• Challenge is to read/write data but not to store it.
• Hadoop is framework to solve the above challenges.
Hadoop
• Hadoop: is an open source java based programming framework that
supports processing of large datasets in distributed computing
environment. It is a part of apache project sponsored by Apache
Software Foundation.
• It is designed to answer the question “How to process big data with
reasonable cost & time”.
• Definition2:
• Apache hadoop ia a framework for distributed processing of large
datasets across clusters of commodity computers/hardware using
simple programming model (mapReduce).
• Commodity hardware is cheap & more in number rather than high
cost high end, less number of servers or super/micro computers.
• Who use hadoop ?:
• Indian Aadar scheme is using hadoop.
• Google has built a new version of distributed file system using
hadoop to handle & analyze its data.
• Yahoo
• Facebook etc….
• History:
• It was founded by yahoo in 2005.
• It was handed over to Google in 2006.
• Now it is Apache hadoop.
• Some Public Cloud services that gives hadoop:
• AWS Elastic MapReduce
• Amazon EC2/S3
• Google Cloud DataProc
Hadoop Components:
• 1.HDFS: (Hadoop Distributed File System)
for storing data across thousands of servers
to achieve high bandwidth.
• 2.MapReduce: Provides programming model
to handle large distributed processing
–mapping data & reducing it to a result.
• Hadoop is the popular open source
implementation of MapReduce, a powerful
tool designed for deep analysis and
transformation of very large data sets. 
Complementary software packages:
• The term Hadoop has come to refer not just to the base modules
above, but also to collection of additional software packages that
can be installed on top of or alongside Hadoop, such as 
• Apache Pig, 
• Apache Hive, 
• Apache HBase, 
• Apache Phoenix, 
• Apache Spark, 
• Apache ZooKeeper, 
• Cloudera Impala, 
• Apache Flume, 
• Apache Sqoop, 
• Apache Oozie, 
• Apache Storm.
• HBase: An open source , non relational distributed database.
• Hive: A datawarehouse that provides data summary
• Pig: A high level platform that creates programs run on hadoop.
• Apache Spark: A fast engine for bigdata processing capable of
streaming & supporting SQL, machine learning, grapg processing.
One survey says, 80 % of hadoop projects are going to mature in
2016 & people are looking towards apache spark for their next
projects.
• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing
Types of tools used in
Big-Data
Difference between Big data & Data Science.
• [http://www.kdnuggets.com/2015/07/data-science-big-data-different-beasts.html]
• Creating artifact from the ore requires the tools, craftmanship and science.
Same is the case of big data and data science, here we present the
distinguishing factors between the ore and the artifact.
• Data Science looks to create models that capture the
underlying patterns of complex systems, and codify those models into
working applications. Big Data looks to collect and manage large
amounts of varied data to serve large-scale web applications and vast
sensor networks.
Although both offer the
potential to produce value
from data, the fundamental
difference between Data
Science and Big Data can be
summarized in one
statement:
-Collecting Does Not
Mean Discovering
Investments in data-focused activities center around
tools instead of approaches. The engineering cart
gets put before the scientific horse, leaving an
organization with a big set of tools, and a small
amount of knowledge on how to convert data into
something useful.
So, Data Science is expertise in converting data to
an useful information/products that answer
always-changing demands of the market.
Salary Trends for Bigdata/hadoop
• Big Data Hadoop Salary Trends
• 1.Average Big Data salaries have increased by 9.3% in the last
12 months. Current salary range is between $119,250 to
$168,250.
• 2.A Hadoop developer making $120,000 will be evaluated by
competitor companies at $155,000. Thats a 29% hike.
• 3.On average there is a new Big Data/Hadoop technology
released every 6 weeks. So make sure you stay updated.
• 4.The average salary for a Hadoop Developer in San Francisco,
CA, is $139,000.
• 5.A Senior Hadoop developer in San Francisco, CA can earn over
$178,000 on an average.
• 6.Hortonworks, Paxata, Bloomberg LP - are hiring top Big Data
Hadoop talent for the highest pay package.
• 7.The states with the most Hadoop Big Data jobs are California,
New York, New Jersey and Texas. - duh that was obvious :)
So, make sure, you stay updated
Future of Big Data
• $15 billion on software firms only specializing in
data management and analytics.
• This industry on its own is worth more than $100
billion and growing at almost 10% a year which is
roughly twice as fast as the software business as a
whole.
• In February 2012, the open source analyst firm
Wikibon released the first market forecast for Big
Data , listing $5.1B revenue in 2012 with growth to
$53.4B in 2017
• The McKinsey Global Institute estimates that data
volume is growing 40% per year, and will grow 44x
between 2009 and 2020.
• So, Data Science as a career goal will enrich
employability of the graduate in future market.
• Big data Market Forecast
References
• www.Slideshare.com
• www.wikipedia.com
• www.computereducation.org
• Strata + Hadoop World 2016: Hadoop and Spark in
spotlight
• http://searchcloudcomputing.techtarget.com/definition/bi
g-data-Big-Data
• http://www.information-management.com/news/big-data-
analytics/the-top-5-trends-in-big-data-for-2017-10029956-
1.html
• Books-
 Big Data by Viktor Mayer-Schonberger
Data analytics & its Trends

Data analytics & its Trends

  • 1.
    Big Data Analytics &Trends Presentation by Dr.K.Sreenivasa Rao Dept. of CSE, VBIT
  • 2.
    Content 1. What isBig data ? 2. Why Big data ? 3. Some Definitions. 4. Types of data-Structured, Unstructured & Semi structured 5. The data Landscape 6. Some other definitions 7. Characteristics of big data 8. Data generation Points 9. Big Data analytics 10.Example Scenario 11.Challenges of Big data 12.Hadoop, History & Complementary Packages 13.Difference between Big data & Data Science. 14.Salary Trends in Hadoop/Big Data
  • 3.
    What is Bigdata? •Facebook generates 10TB daily •Twitter generates 7TB of data Daily •IBM claims 90% of today’s stored data was generated in just the last two years.
  • 4.
    Why Big Data? • Growth of Big Data is needed because of – Increase of storage capacities – Increase of processing power – Availability of data(different data types) – Every day we create 2.5 Million TB[quintillion bytes(1 Quintillionbyte= 1 Exabyte=1000Petabytes where 1 Petabyte=1000 TB)] of data; 90% of the data in the world today has been created in the last two years alone. • FB generates 10TB daily • Twitter generates 7TB of data Daily • IBM claims 90% of today’s stored data was generated in just the last two years.
  • 5.
    Some Definitions • Bigdata is a "catch all" word, related to the power of using a lot of data to solve problems.. Big data is the data that is large enough and complex that it becomes difficult to process using a single computer... • Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats.
  • 6.
    Some Definitions • Bigdata is an evolving term that describes any voluminous amount of structured, semi structured and unstructured data that has the potential to be mined for information. [Ref: Strata + Hadoop World 2016: Hadoop and Spark in spotlight]
  • 10.
  • 12.
    Some Other Definitions •Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. • Big data is often characterized by 3Vs: the extreme volume of data, the wide variety of data types and the velocity at which the data must be processed. Although big data doesn't equate to any specific volume of data, the term is often used to describe Terabytes, Petabytes and even Exabytes of data captured over time.
  • 13.
    Characteristics of Bigdata Volume: (Data Quantity) • Twitter generates about 80 MB per second. • Facebook generates 10 TB data per day. • Black box data: Single flight generates nearly 10 TB of data per every ½ an hour. • Twitter generates of about 80 MB every second. Velocity: (Data Speed) ebay analyzes 5 million transactions per day. • Finally, velocity refers to the speed at which big data must be analyzed. Velocity is also meaningful, as big data analysis expands into fields like machine learning and artificial intelligence, where analytical processes mimic perception by finding and using patterns in the collected data. Variety: (Data Types) Bigdata includes data from e-commerce sites, health care data, education, stock exchange, banking etc….. Varying in Time: • [http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data]
  • 14.
  • 16.
    Data generation PointsExamples Mobile Devices Readers/Scanners Science facilities Microphones Cameras Social Media Programs/ Software
  • 17.
    Big Data Analytics •Examining large amount of data • Appropriate information • Identification of hidden patterns, unknown correlations • Competitive advantage • Better business decisions: Strategic and Operational • Effective marketing, customer satisfaction, increased revenue
  • 18.
    Example Scenario U needreading articles, Pictures & videos, links to facebook & twitter etc….
  • 19.
  • 20.
    Watching Videos etc…still have no clarity….
  • 21.
    Such bigdata isto be sorted, filtered & analyzed to produce useful information for decision making.
  • 22.
    For haps facebookmay help u better to identify best gym equipment for your office….. Finally Analytics gives us useful insight or information from big data.
  • 23.
    Challenges of bigdata: • Problem: To read 1 TB data from a hard drive • Sol1: 1 machine of 4 I/O channels of 100 MBps • 1 TB=1024*1024 MB • 10,48,576 MB • =10, 485 Seconds • =174.75 Minutes by 1 i/o channel • =174.75/4 • =43.6 Minutes for by 4 i/o channels • Sol2: If 10 machines are used for reading it takes 43.6/10=4.36 minutes to read 1 TB data. • i.e to analyze big data, first we need to read it, today challenge is i/o speed but not storage capacity. • Challenge is to read/write data but not to store it. • Hadoop is framework to solve the above challenges.
  • 24.
    Hadoop • Hadoop: isan open source java based programming framework that supports processing of large datasets in distributed computing environment. It is a part of apache project sponsored by Apache Software Foundation. • It is designed to answer the question “How to process big data with reasonable cost & time”. • Definition2: • Apache hadoop ia a framework for distributed processing of large datasets across clusters of commodity computers/hardware using simple programming model (mapReduce). • Commodity hardware is cheap & more in number rather than high cost high end, less number of servers or super/micro computers. • Who use hadoop ?: • Indian Aadar scheme is using hadoop. • Google has built a new version of distributed file system using hadoop to handle & analyze its data. • Yahoo • Facebook etc….
  • 25.
    • History: • Itwas founded by yahoo in 2005. • It was handed over to Google in 2006. • Now it is Apache hadoop. • Some Public Cloud services that gives hadoop: • AWS Elastic MapReduce • Amazon EC2/S3 • Google Cloud DataProc
  • 26.
    Hadoop Components: • 1.HDFS:(Hadoop Distributed File System) for storing data across thousands of servers to achieve high bandwidth. • 2.MapReduce: Provides programming model to handle large distributed processing –mapping data & reducing it to a result. • Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. 
  • 27.
    Complementary software packages: •The term Hadoop has come to refer not just to the base modules above, but also to collection of additional software packages that can be installed on top of or alongside Hadoop, such as  • Apache Pig,  • Apache Hive,  • Apache HBase,  • Apache Phoenix,  • Apache Spark,  • Apache ZooKeeper,  • Cloudera Impala,  • Apache Flume,  • Apache Sqoop,  • Apache Oozie,  • Apache Storm. • HBase: An open source , non relational distributed database. • Hive: A datawarehouse that provides data summary • Pig: A high level platform that creates programs run on hadoop. • Apache Spark: A fast engine for bigdata processing capable of streaming & supporting SQL, machine learning, grapg processing. One survey says, 80 % of hadoop projects are going to mature in 2016 & people are looking towards apache spark for their next projects.
  • 28.
    • Where processingis hosted? – Distributed Servers / Cloud (e.g. Amazon EC2) • Where data is stored? – Distributed Storage (e.g. Amazon S3) • What is the programming model? – Distributed Processing (e.g. MapReduce) • How data is stored & indexed? – High-performance schema-free databases (e.g. MongoDB) • What operations are performed on data? – Analytic / Semantic Processing Types of tools used in Big-Data
  • 29.
    Difference between Bigdata & Data Science. • [http://www.kdnuggets.com/2015/07/data-science-big-data-different-beasts.html] • Creating artifact from the ore requires the tools, craftmanship and science. Same is the case of big data and data science, here we present the distinguishing factors between the ore and the artifact. • Data Science looks to create models that capture the underlying patterns of complex systems, and codify those models into working applications. Big Data looks to collect and manage large amounts of varied data to serve large-scale web applications and vast sensor networks. Although both offer the potential to produce value from data, the fundamental difference between Data Science and Big Data can be summarized in one statement: -Collecting Does Not Mean Discovering
  • 30.
    Investments in data-focusedactivities center around tools instead of approaches. The engineering cart gets put before the scientific horse, leaving an organization with a big set of tools, and a small amount of knowledge on how to convert data into something useful. So, Data Science is expertise in converting data to an useful information/products that answer always-changing demands of the market.
  • 31.
    Salary Trends forBigdata/hadoop • Big Data Hadoop Salary Trends • 1.Average Big Data salaries have increased by 9.3% in the last 12 months. Current salary range is between $119,250 to $168,250. • 2.A Hadoop developer making $120,000 will be evaluated by competitor companies at $155,000. Thats a 29% hike. • 3.On average there is a new Big Data/Hadoop technology released every 6 weeks. So make sure you stay updated. • 4.The average salary for a Hadoop Developer in San Francisco, CA, is $139,000. • 5.A Senior Hadoop developer in San Francisco, CA can earn over $178,000 on an average. • 6.Hortonworks, Paxata, Bloomberg LP - are hiring top Big Data Hadoop talent for the highest pay package. • 7.The states with the most Hadoop Big Data jobs are California, New York, New Jersey and Texas. - duh that was obvious :)
  • 36.
    So, make sure,you stay updated
  • 38.
    Future of BigData • $15 billion on software firms only specializing in data management and analytics. • This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole. • In February 2012, the open source analyst firm Wikibon released the first market forecast for Big Data , listing $5.1B revenue in 2012 with growth to $53.4B in 2017 • The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and 2020.
  • 39.
    • So, DataScience as a career goal will enrich employability of the graduate in future market. • Big data Market Forecast
  • 40.
    References • www.Slideshare.com • www.wikipedia.com •www.computereducation.org • Strata + Hadoop World 2016: Hadoop and Spark in spotlight • http://searchcloudcomputing.techtarget.com/definition/bi g-data-Big-Data • http://www.information-management.com/news/big-data- analytics/the-top-5-trends-in-big-data-for-2017-10029956- 1.html • Books-  Big Data by Viktor Mayer-Schonberger