Big Data, Big Deal
For Future Big Data Scientists
Prepared By: Wei-Yen Lin
May, 2013
Outline
 A Buzzword: Big Data
 What Is Big Data: Big Data 101
 What Make It Happen: Drivers For Big Data
 How To Deal With: Existing Big Data Technologies
 How To Improve: Challenges For Big Data
 How To Be A Good Big Data Scientist
Big Data, Big Deal 2013  Page 2
Trying To Answer ....
Big Data, Big Deal 2013  Page 3
Everyone Is Talking About Big Data...
Big Data, Big Deal 2013  Page 4
A Truth: We„ve Already Lived In A Big Data World
Big Data, Big Deal 2013  Page 5
Units Of Big Data Is Different...
Big Data, Big Deal 2013  Page 6
Source: Computer Sciences
Corporation, 2012
Big Data, Big Deal 2013  Page 7
So, Big Data = Data Is Big?
Big Data, Big Deal 2013  Page 8
Origin Of The Term
 First ACM article to use the term
(Michael Cox and David Ellsworth, Ames Research Center, 1997)
“…data sets are generally quite large, taxing the capacities of main
memory, local disk, and even remote disk. We call this the problem of big
data.”
 First definition
(Francis Diebold, University of Pennsylvania, 2000)
“Big Data refers to the explosion in the quantity (and sometimes, quality)
of available and potentially relevant data, largely the result of recent and
unprecedented advancements in data recording and storage technology.”
Big Data, Big Deal 2013  Page 9
Big Data, Big Deal 2013  Page 10
Commonly accepted 3 V‟s of Big Data
Doug Laney with the Meta Group, 2001
Volume, Velocity, Variety: Examples
 Volume – Terabyte records, transactions, tables, files
– a Jumbo jet create 640TB on one Atlantic crossing X 25,000 flights flown
each day
 Velocity – batch, near-time, real-time, streams.
– Today’s on-line ad serving requires 40ms to respond with a decision.
– Financial services need near 1MS to calculate customer scoring
probabilities
 Variety – structures, unstructured, semi-structured, and all the
above in a mix.
–WalMart processes 1M customer transactions per hour and feeds
information to a database estimated at 2.5PB (petabytes).
–There are old and new data sources like RFID, sensors, mobile payments,
in-vehicle tracking, etc.
Big Data, Big Deal 2013  Page 11
Three Top-Level Elements
 Data storage infrastructure, and resources to manipulate it
Big Data, Big Deal 2013  Page 12
Data Management
Data Analysis
 Technologies and tools to analyze the data and glean insight from it
Data Use
 Putting Big Data insights to work in Business Intelligence and end-user
applications
Source: Martin Hall, 2011
To Sum Up, Big Data Is …
Big Data, Big Deal 2013  Page 13
Big Data is high-volume, high-velocity, and/or high-
variety information assets that require new forms of
processing to enable enhanced decision making,
insight discovery and process optimization.
Characteristic
Goal
Solution
Big Data, Big Deal 2013  Page 14
Key Drivers of Big Data Technology Demand
 Scientific experiments and tools are becoming heavily based on data
processing
Big Data, Big Deal 2013  Page 15
Modern Science in search for new knowledge
Google and Facebook: have driven many advances in Big Data efficiency
Technical Drivers (1)
 Google handles number of search queries at 3 billion per day
 Twitter handles some 400 million tweets per day count for 12 terabytes
per day
 The McKinsey Quarterly:The demand for storage has grown more than
50% annually in recent years
Big Data, Big Deal 2013  Page 16
Data collected and stored continues to grow exponentially
Data is increasingly everywhere and in many formats
Key Drivers of Big Data Technology Demand
Technical Drivers (2)
 Genomic research, drugs development, Healthcare
 High-tech industry, CAD/CAM, weather/climate, etc.
Big Data, Big Deal 2013  Page 17
Traditional data intensive industry
Business (retail) uses Big Data technologies “to search” for customers
 Delivering directly to customers requires prediction of customer behavior
Key Drivers of Big Data Technology Demand
Business Drivers (1)
 Captures preferences by the user and makes recommendations based
on previous record
Big Data, Big Deal 2013  Page 18
Consumer products and services delivery
The rise of public opinion stored in platforms
Key Drivers of Big Data Technology Demand
Business Drivers (2)
Social media
 Managing public campaigns , e.g. election, integrated public relations
Big Data, Big Deal 2013  Page 19
Big Data, Big Deal 2013  Page 20
Big Data Techniques
Few Examples
 Supervised Learning – Support Vector Machine
 Unsupervised learning – Cluster Analysis
 Data fusion – Signal processing, Natural Language Processing
 Optimization – Genetic Algorithm, Neural Networks
 Predictive Modeling – Regression, Time Series Analysis
Big Data, Big Deal 2013  Page 21
Big Data Technologies
Where processing is hosted?
— Distributed Servers/Cloud (e.g. Amazon EC2)
Where data is stored?
— Distributed Storage (e.g. HadoopDFS)
What is programming model?
— Distributed Processing (e.g. MapReduce)
How data is stored& indexed?
— High-performance schema-free database (e.g. Cassandra)
What operations are performed?
— Data Analytics, Semantic Processing (e.g. R)
Big Data, Big Deal 2013  Page 22
Big Data Landscape
Source: Forbes, 2012
Big Data, Big Deal 2013  Page 23
From Data Mining To Big Data Mining
Big Data, Big Deal 2013  Page 24
Source: Robert J. Abate, 2012
The Life Cycle Of Big Data Method Should Be ...
Big Data, Big Deal 2013  Page 25
Source: Robert J. Abate, 2012
Challenge For Big Data
 How to find high-quality data from the vast collections of data? How good
is the data? How broad is the coverage?
Big Data, Big Deal 2013  Page 26
Data quality
Data comprehensiveness
Data
 Are there areas without coverage? What are the implications?
Data Reliability and Validity
 How to determine the quality of data sets and relevance to particular
issues
Challenge For Big Data
 To handle/discover new data structures and multi-type data relations
 To respond to specific use cases and operations over data
Big Data, Big Deal 2013  Page 27
Data mining/data intelligence algorithms
Processing
Data interpretation
 Understand the output and model it through some form of simulation.
Domain experts must continue to play a role. Must be wary of becoming
too beholden to the numbers.
Challenge For Big Data
 Is Cloud Computing a right technology? Any alternative?
 Highspeed network infrastructure, on-demand provisioning
 To respond to specific use cases and operations over data
Big Data, Big Deal 2013  Page 28
Infrastructure support for storing, moving data, on-demand processing
Management
Security, trustworthiness and data centric security
 Much of this information is about people. How to extract enough
information to help people without extracting so much as to compromise
their privacy?
Big Data, Big Deal 2013  Page 29
Big Data Talent
Big Data, Big Deal 2013  Page 30
Three Groups: Deep Analytical, Big Data Savvy, Supporting Tech.
Source: U.S. Bureau Of Labor Statistics, McKinsey
Technical expertise
have deep expertise in some scientific discipline.
Curiosity
a desire to go beneath the surface
Storytelling
the ability to use data to tell a story and
to be able to communicate it effectively.
Cleverness
the ability to look at a problem in different,
creative ways.
Qualities Of Data Scientists
Big Data, Big Deal 2013  Page 31
Advice From DJ Patil, The World's 7 Most Powerful Data Scientists(Forbes)
Big Data, Big Deal 2013  Page 32

Big Data, Big Deal: For Future Big Data Scientists

  • 1.
    Big Data, BigDeal For Future Big Data Scientists Prepared By: Wei-Yen Lin May, 2013
  • 2.
    Outline  A Buzzword:Big Data  What Is Big Data: Big Data 101  What Make It Happen: Drivers For Big Data  How To Deal With: Existing Big Data Technologies  How To Improve: Challenges For Big Data  How To Be A Good Big Data Scientist Big Data, Big Deal 2013  Page 2 Trying To Answer ....
  • 3.
    Big Data, BigDeal 2013  Page 3
  • 4.
    Everyone Is TalkingAbout Big Data... Big Data, Big Deal 2013  Page 4
  • 5.
    A Truth: We„veAlready Lived In A Big Data World Big Data, Big Deal 2013  Page 5
  • 6.
    Units Of BigData Is Different... Big Data, Big Deal 2013  Page 6 Source: Computer Sciences Corporation, 2012
  • 7.
    Big Data, BigDeal 2013  Page 7 So, Big Data = Data Is Big?
  • 8.
    Big Data, BigDeal 2013  Page 8
  • 9.
    Origin Of TheTerm  First ACM article to use the term (Michael Cox and David Ellsworth, Ames Research Center, 1997) “…data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data.”  First definition (Francis Diebold, University of Pennsylvania, 2000) “Big Data refers to the explosion in the quantity (and sometimes, quality) of available and potentially relevant data, largely the result of recent and unprecedented advancements in data recording and storage technology.” Big Data, Big Deal 2013  Page 9
  • 10.
    Big Data, BigDeal 2013  Page 10 Commonly accepted 3 V‟s of Big Data Doug Laney with the Meta Group, 2001
  • 11.
    Volume, Velocity, Variety:Examples  Volume – Terabyte records, transactions, tables, files – a Jumbo jet create 640TB on one Atlantic crossing X 25,000 flights flown each day  Velocity – batch, near-time, real-time, streams. – Today’s on-line ad serving requires 40ms to respond with a decision. – Financial services need near 1MS to calculate customer scoring probabilities  Variety – structures, unstructured, semi-structured, and all the above in a mix. –WalMart processes 1M customer transactions per hour and feeds information to a database estimated at 2.5PB (petabytes). –There are old and new data sources like RFID, sensors, mobile payments, in-vehicle tracking, etc. Big Data, Big Deal 2013  Page 11
  • 12.
    Three Top-Level Elements Data storage infrastructure, and resources to manipulate it Big Data, Big Deal 2013  Page 12 Data Management Data Analysis  Technologies and tools to analyze the data and glean insight from it Data Use  Putting Big Data insights to work in Business Intelligence and end-user applications Source: Martin Hall, 2011
  • 13.
    To Sum Up,Big Data Is … Big Data, Big Deal 2013  Page 13 Big Data is high-volume, high-velocity, and/or high- variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. Characteristic Goal Solution
  • 14.
    Big Data, BigDeal 2013  Page 14
  • 15.
    Key Drivers ofBig Data Technology Demand  Scientific experiments and tools are becoming heavily based on data processing Big Data, Big Deal 2013  Page 15 Modern Science in search for new knowledge Google and Facebook: have driven many advances in Big Data efficiency Technical Drivers (1)  Google handles number of search queries at 3 billion per day  Twitter handles some 400 million tweets per day count for 12 terabytes per day
  • 16.
     The McKinseyQuarterly:The demand for storage has grown more than 50% annually in recent years Big Data, Big Deal 2013  Page 16 Data collected and stored continues to grow exponentially Data is increasingly everywhere and in many formats Key Drivers of Big Data Technology Demand Technical Drivers (2)
  • 17.
     Genomic research,drugs development, Healthcare  High-tech industry, CAD/CAM, weather/climate, etc. Big Data, Big Deal 2013  Page 17 Traditional data intensive industry Business (retail) uses Big Data technologies “to search” for customers  Delivering directly to customers requires prediction of customer behavior Key Drivers of Big Data Technology Demand Business Drivers (1)
  • 18.
     Captures preferencesby the user and makes recommendations based on previous record Big Data, Big Deal 2013  Page 18 Consumer products and services delivery The rise of public opinion stored in platforms Key Drivers of Big Data Technology Demand Business Drivers (2) Social media  Managing public campaigns , e.g. election, integrated public relations
  • 19.
    Big Data, BigDeal 2013  Page 19
  • 20.
    Big Data, BigDeal 2013  Page 20 Big Data Techniques Few Examples  Supervised Learning – Support Vector Machine  Unsupervised learning – Cluster Analysis  Data fusion – Signal processing, Natural Language Processing  Optimization – Genetic Algorithm, Neural Networks  Predictive Modeling – Regression, Time Series Analysis
  • 21.
    Big Data, BigDeal 2013  Page 21 Big Data Technologies Where processing is hosted? — Distributed Servers/Cloud (e.g. Amazon EC2) Where data is stored? — Distributed Storage (e.g. HadoopDFS) What is programming model? — Distributed Processing (e.g. MapReduce) How data is stored& indexed? — High-performance schema-free database (e.g. Cassandra) What operations are performed? — Data Analytics, Semantic Processing (e.g. R)
  • 22.
    Big Data, BigDeal 2013  Page 22 Big Data Landscape Source: Forbes, 2012
  • 23.
    Big Data, BigDeal 2013  Page 23
  • 24.
    From Data MiningTo Big Data Mining Big Data, Big Deal 2013  Page 24 Source: Robert J. Abate, 2012
  • 25.
    The Life CycleOf Big Data Method Should Be ... Big Data, Big Deal 2013  Page 25 Source: Robert J. Abate, 2012
  • 26.
    Challenge For BigData  How to find high-quality data from the vast collections of data? How good is the data? How broad is the coverage? Big Data, Big Deal 2013  Page 26 Data quality Data comprehensiveness Data  Are there areas without coverage? What are the implications? Data Reliability and Validity  How to determine the quality of data sets and relevance to particular issues
  • 27.
    Challenge For BigData  To handle/discover new data structures and multi-type data relations  To respond to specific use cases and operations over data Big Data, Big Deal 2013  Page 27 Data mining/data intelligence algorithms Processing Data interpretation  Understand the output and model it through some form of simulation. Domain experts must continue to play a role. Must be wary of becoming too beholden to the numbers.
  • 28.
    Challenge For BigData  Is Cloud Computing a right technology? Any alternative?  Highspeed network infrastructure, on-demand provisioning  To respond to specific use cases and operations over data Big Data, Big Deal 2013  Page 28 Infrastructure support for storing, moving data, on-demand processing Management Security, trustworthiness and data centric security  Much of this information is about people. How to extract enough information to help people without extracting so much as to compromise their privacy?
  • 29.
    Big Data, BigDeal 2013  Page 29
  • 30.
    Big Data Talent BigData, Big Deal 2013  Page 30 Three Groups: Deep Analytical, Big Data Savvy, Supporting Tech. Source: U.S. Bureau Of Labor Statistics, McKinsey
  • 31.
    Technical expertise have deepexpertise in some scientific discipline. Curiosity a desire to go beneath the surface Storytelling the ability to use data to tell a story and to be able to communicate it effectively. Cleverness the ability to look at a problem in different, creative ways. Qualities Of Data Scientists Big Data, Big Deal 2013  Page 31 Advice From DJ Patil, The World's 7 Most Powerful Data Scientists(Forbes)
  • 32.
    Big Data, BigDeal 2013  Page 32