random notes on big data
Chen Peng, Jianqiang Wang, Yang Huang
April 19, 2013
What is big data
● Volume: Gigabytes-
>Terabytes -
>Petabytes.
● Velocity: time
sensitive, streaming,
real-time.
Jet engine: 20TB/hr
GE: (minds + machines)
● Variety:
structured/unstructur
ed.
● Value: insights,
analytical systems.
Challenges: collect, store, organize, analyze and share
External
> web sites (blogs/reviews)
> social media (Facebook, LinkedIn,
Google+, Twitter)
> images and videos
> ...
Internal
> transactions
> server logs
> machines and sensors
> emails
> ...
Variety
Value Hierarchy
Raw Data
Normalized
Insight
Recommendation
Transact
Data is now a strategic asset
Technology stack & corresponding
firms
Google
App Engine
Google
BigQuery
Scalable
application
development and
execution
environment
Google
Compute Engine
Virtual machines
Run arbitrary workloads
at scale
(e.g. Hadoop, scientific
computing)
Google Cloud Platform
Google
Cloud Storage
Storage
Connecting glue between
each step of the data
pipeline
Data analysis
Querying large datasets
+ third party apps for
visualization (e.g.
Tableau)
Big data analytics
Analytics is
The scientific process of transforming data into
insights for making better decisions.
Data Insight Decision
IT logs, cloud,
social media,
sensors,
experiments,
etc.
statistical &
operations research
modeling
judgement,
constraints,
intuition
"resource" "product" "goal"
Predictive analytics extracts information from data and
use it to predict future trends and behavior patterns.
regression models
discrete choice models
time series models
classification models (decision tree, random forest, support vector machine,
neural network, etc.)
clustering models (k-means, density based, graph based, etc.)
association analysis
...
Big data analytics
Descriptive Analytics
Predictive Analytics
Prescriptive Analytics
Always keep in mind...
> business objectives are the origin of every data mining solution
> data preparation is more than half of the data mining process
> all patterns are subject to change
> there will always be new knowledge
Always pause and ask yourself:
Does this work relate to the business question we try to answer?
Is the original business question still valid?
Industry Use-cases/Application
Healthcare Drug development
Patient monitoring
Electronic Medical Records
Utilities Smart grid optimization
Retail &
marketing
Customer loyalty and churn analysis
Targeted product and services offerings
Product sentiment analysis
Marketing campaign optimization
Financial
services
Fraud detection & prevention
Anti-money laundering
Telecom Customer churn mitigation
Geospatial analytics
Call data record (CDR) analysis
Use cases by industry
Industry applications of big data
analytics
Customer acquisition
predict customers' buying habits in order to promote relevant products at
multiple touch points.
http://www.youtube.com/watch?feature=player_embedded&v=3WspJ16Ubhw
Clinical decision support
Experts use predictive analysis in health care primarily to determine which
patients are at risk of developing certain conditions, like diabetes, asthma, heart
disease, and other lifetime illnesses.
Cross sale
predictive analytics can help analyze customers' spending, usage and other
behavior, leading to efficient cross sales, or selling additional products to
current customers (beer & diaper)
Ads targeting
http://www.slideshare.net/dennyglee/yahoo-tao-case-study-excerpt
Fraud detection
A predictive model can help weed out the "bads" and reduce a business's
exposure to fraud.
Image and Speech Recognition
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.
com/en/us/people/jeff/MIT_BigData_Sep2012.pdf
Operations
Jet Engine + Humans
http://www.youtube.com/watch?v=JHc4ZTTWKrQ
Industry applications of big data
analytics
Amazon wareouse operational efficiency: http://www.youtube.com/watch?
v=Kafs9tZskuo
Beer and diaper
What are those startups doing?
Bloomreach
http://www.youtube.com/watch?feature=player_embedded&v=K12awAj4tW8
Datastax
http://www.nytimes.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee-
its-popularity.html?pagewanted=all
Paraccel
http://www.paraccel.com/solutions/paraccel-solutions-big-data.php#.UXG207WG3Ct
Kaggle
http://www.kaggle.com/c/acm-sf-chapter-hackathon-big
VC funding for "Big Data"
Data from 71 start-ups. Funding is
counted starting from 2004.
VC Funding Activity
Data from 71 start-ups. Funding is
counted starting from 2004.
Interesting view points
" Special (domain) knowledge becomes less relevant;
organizations should focus on collecting people who know
how to extract value and insights from data."
" In god we trust. All others must bring data."
" The usefulness of a variable in a model is inversely
related to the time you spend creating it."
"Noise is convex but information is concave."
"Big data is sexy but small data is beautiful."
noise
information
data size
Interesting view points
"All models are wrong, but some are useful."
"Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it; everyone thinks everyone
else is doing it, so they claim they are doing it."
"Statistics: The Art and Science of Learning from Data"
The danger of big data
Open discussion
Potential opportunities / challenges for
entrepreneurs?
- visualization
- internet of things
- analytics as a service (a3
s)
Standardization v.s. customization
Human and data interaction
- data v.s intuition
Back-Up Slides
Data Science v.s. OR
risk management strategic planning
predictive analytics optimization
Risk
Measurable of Objective
skill sets of data scientists
Big data types
● Web & social media: clickstream, web content,
amazon reviews, facebook postings & 'like'...
● M2M:smart meters, oil rig sensor reading, GPS
signals...
● Transaction:retail store, healthcare claims, utility
billing...
● Biometrics:fingerprint, face, voice, handwriting..
● Human-generated data:call logs, emails, surveys...
Web & social media
● Transaction: orders, revenue,
● Conversion: click thru, convert to
purchase,...
● Session: length, bounce rate
● Lifetime value: repeat, frequency,...
● Social interaction: intensity,
influence,...
Shopping cart analysis
CTR prediction
Personalization
Retention/customer
churn
A/B testing
Targeted ads
Lifetime value
Interesting data visualization
projects
wind map
http://hint.fm/wind/gallery/oct-30.js.html
Some analytical problems people
deal with at Google ...
● search ranking
Processing Pipeline
Hadoop
MapReduce
log
sensor
web
...
Structured
Data
Note: Hadoop -- an open-source software framework that supports data-intensive distributed
applications, licensed under the Apache v2 license. It supports the running of applications on large
clusters of commodity hardware. Orginated from Google MapReduce and further developed/promoted by
Yahoo.
SQL
HIVE
Dremel ...
Analytics
Big Data
Cloud
Computing
http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
How big is big?
When your data set becomes so large that you have to
start innovating around how to collect, store, organize,
analyze and share it ...
External
> web sites (blogs/reviews)
> social media (Facebook,
LinkedIn, Google+, Twitter)
> images and videos
> ...
Internal
> transactions
> server logs
> machines and sensors
> emails
> ...
Health
care
Sentiment
analysis
Patient
monitoring
Genetic
Testing
Electronic
Medical
Records
Utilities Smart
Meters
Retail Loyalty
programs
RFID tags Recommenda
tion, market
basket
Face
recognition
Telcos Customer
churn
Location-
based
IT Machine
log
Web &
Social
media
M2M Transaction Biometrics Human-
generat
ed
Example of semantic graph
Call Data Record
What is Hadoop

Random notes on big data

  • 1.
    random notes onbig data Chen Peng, Jianqiang Wang, Yang Huang April 19, 2013
  • 2.
  • 3.
    ● Volume: Gigabytes- >Terabytes- >Petabytes. ● Velocity: time sensitive, streaming, real-time. Jet engine: 20TB/hr GE: (minds + machines) ● Variety: structured/unstructur ed. ● Value: insights, analytical systems.
  • 4.
    Challenges: collect, store,organize, analyze and share External > web sites (blogs/reviews) > social media (Facebook, LinkedIn, Google+, Twitter) > images and videos > ... Internal > transactions > server logs > machines and sensors > emails > ... Variety
  • 5.
  • 6.
    Technology stack &corresponding firms
  • 7.
    Google App Engine Google BigQuery Scalable application development and execution environment Google ComputeEngine Virtual machines Run arbitrary workloads at scale (e.g. Hadoop, scientific computing) Google Cloud Platform Google Cloud Storage Storage Connecting glue between each step of the data pipeline Data analysis Querying large datasets + third party apps for visualization (e.g. Tableau)
  • 8.
    Big data analytics Analyticsis The scientific process of transforming data into insights for making better decisions. Data Insight Decision IT logs, cloud, social media, sensors, experiments, etc. statistical & operations research modeling judgement, constraints, intuition "resource" "product" "goal"
  • 9.
    Predictive analytics extractsinformation from data and use it to predict future trends and behavior patterns. regression models discrete choice models time series models classification models (decision tree, random forest, support vector machine, neural network, etc.) clustering models (k-means, density based, graph based, etc.) association analysis ... Big data analytics Descriptive Analytics Predictive Analytics Prescriptive Analytics
  • 10.
    Always keep inmind... > business objectives are the origin of every data mining solution > data preparation is more than half of the data mining process > all patterns are subject to change > there will always be new knowledge Always pause and ask yourself: Does this work relate to the business question we try to answer? Is the original business question still valid?
  • 11.
    Industry Use-cases/Application Healthcare Drugdevelopment Patient monitoring Electronic Medical Records Utilities Smart grid optimization Retail & marketing Customer loyalty and churn analysis Targeted product and services offerings Product sentiment analysis Marketing campaign optimization Financial services Fraud detection & prevention Anti-money laundering Telecom Customer churn mitigation Geospatial analytics Call data record (CDR) analysis Use cases by industry
  • 12.
    Industry applications ofbig data analytics Customer acquisition predict customers' buying habits in order to promote relevant products at multiple touch points. http://www.youtube.com/watch?feature=player_embedded&v=3WspJ16Ubhw Clinical decision support Experts use predictive analysis in health care primarily to determine which patients are at risk of developing certain conditions, like diabetes, asthma, heart disease, and other lifetime illnesses. Cross sale predictive analytics can help analyze customers' spending, usage and other behavior, leading to efficient cross sales, or selling additional products to current customers (beer & diaper) Ads targeting http://www.slideshare.net/dennyglee/yahoo-tao-case-study-excerpt
  • 13.
    Fraud detection A predictivemodel can help weed out the "bads" and reduce a business's exposure to fraud. Image and Speech Recognition http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google. com/en/us/people/jeff/MIT_BigData_Sep2012.pdf Operations Jet Engine + Humans http://www.youtube.com/watch?v=JHc4ZTTWKrQ Industry applications of big data analytics Amazon wareouse operational efficiency: http://www.youtube.com/watch? v=Kafs9tZskuo
  • 14.
  • 16.
    What are thosestartups doing? Bloomreach http://www.youtube.com/watch?feature=player_embedded&v=K12awAj4tW8 Datastax http://www.nytimes.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee- its-popularity.html?pagewanted=all Paraccel http://www.paraccel.com/solutions/paraccel-solutions-big-data.php#.UXG207WG3Ct Kaggle http://www.kaggle.com/c/acm-sf-chapter-hackathon-big
  • 17.
    VC funding for"Big Data" Data from 71 start-ups. Funding is counted starting from 2004.
  • 18.
    VC Funding Activity Datafrom 71 start-ups. Funding is counted starting from 2004.
  • 19.
    Interesting view points "Special (domain) knowledge becomes less relevant; organizations should focus on collecting people who know how to extract value and insights from data." " In god we trust. All others must bring data." " The usefulness of a variable in a model is inversely related to the time you spend creating it." "Noise is convex but information is concave." "Big data is sexy but small data is beautiful." noise information data size
  • 20.
    Interesting view points "Allmodels are wrong, but some are useful." "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it; everyone thinks everyone else is doing it, so they claim they are doing it." "Statistics: The Art and Science of Learning from Data"
  • 21.
    The danger ofbig data
  • 22.
    Open discussion Potential opportunities/ challenges for entrepreneurs? - visualization - internet of things - analytics as a service (a3 s) Standardization v.s. customization Human and data interaction - data v.s intuition
  • 23.
  • 24.
    Data Science v.s.OR risk management strategic planning predictive analytics optimization Risk Measurable of Objective skill sets of data scientists
  • 26.
    Big data types ●Web & social media: clickstream, web content, amazon reviews, facebook postings & 'like'... ● M2M:smart meters, oil rig sensor reading, GPS signals... ● Transaction:retail store, healthcare claims, utility billing... ● Biometrics:fingerprint, face, voice, handwriting.. ● Human-generated data:call logs, emails, surveys...
  • 27.
    Web & socialmedia ● Transaction: orders, revenue, ● Conversion: click thru, convert to purchase,... ● Session: length, bounce rate ● Lifetime value: repeat, frequency,... ● Social interaction: intensity, influence,... Shopping cart analysis CTR prediction Personalization Retention/customer churn A/B testing Targeted ads Lifetime value
  • 28.
    Interesting data visualization projects windmap http://hint.fm/wind/gallery/oct-30.js.html
  • 29.
    Some analytical problemspeople deal with at Google ... ● search ranking
  • 30.
    Processing Pipeline Hadoop MapReduce log sensor web ... Structured Data Note: Hadoop-- an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Orginated from Google MapReduce and further developed/promoted by Yahoo. SQL HIVE Dremel ... Analytics Big Data Cloud Computing http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
  • 31.
    How big isbig? When your data set becomes so large that you have to start innovating around how to collect, store, organize, analyze and share it ... External > web sites (blogs/reviews) > social media (Facebook, LinkedIn, Google+, Twitter) > images and videos > ... Internal > transactions > server logs > machines and sensors > emails > ...
  • 32.
    Health care Sentiment analysis Patient monitoring Genetic Testing Electronic Medical Records Utilities Smart Meters Retail Loyalty programs RFIDtags Recommenda tion, market basket Face recognition Telcos Customer churn Location- based IT Machine log Web & Social media M2M Transaction Biometrics Human- generat ed
  • 33.
  • 35.
  • 40.