©2015 Protegra Inc. All rights reserved.
Big Data
Terry Bunio - Protegra
Who Am I?
• Data Base Administrator
- Oracle
- SQL Server 6,6.5,7,2000,2005,2008,2012
- Informix
- ADABAS
• Data Modeler/Architect
- Investors Group, LPL Financial, Manitoba Blue Cross, Assante
Financial, CI Funds, Mackenzie Financial
- Normalized and Dimensional
• Agilist
- Innovation Gamer, Team Member, SQL Developer, Test writer, Sticky
Sticker, Project Manager, PMO on SAP Implementation
@tbunio
tbunio@protegra.com
agilevoyageur.com
www.protegra.com
Where can you find me?
Members of the Protegra Community of
software-driven businesses & solutions
Definition
Myths
• Big Data Myth #1: It’s Big
• Big Data Myth #2: You need to apply it right away
• Big Data Myth #3: The more granular the data, the better
• Big Data Myth #4: Big Data is good data
• Big Data Myth #5: Big Data means that analysts become all-important
• Big Data Myth #6: Big Data gives you concrete answers
• Big Data Myth #7: Big Data is a magic 8-ball
• Big Data Myth #8: Big Data can create self-learning algorithms
Evolution
Term Time Frame
Decision Support 1970-1985
Executive Support 1980-1990
Online Analytical Processing 1990-2000
Business Intelligence 1989-2005
Analytics 2005-2010
Big Data 2010-2015
Next??
Facts
• Every 2 days we create as much information as we did from the
beginning of time until 2003
• Over 90% of all the data in the world was created in the past 2 years
• It is expected that by 2020 the amount of digital information in
existence will have grown from 3.2 zettabytes today to 40 zettabytes
• The total amount of data being captured and stored by industry
doubles every 1.2 years
• Every minute we send 204 million emails, generate 1.8 million
Facebook likes, send 278 thousand Tweets, and up-load 200,000
photos to Facebook
Facts
• Google alone processes on average over 40 thousand search
queries per second, making it over 3.5 billion in a single day
• Around 100 hours of video are uploaded to YouTube every
minute and it would take you around 15 years to watch every
video uploaded by users in one day
• If you burned all of the data created in just one day onto DVDs,
you could stack them on top of each other and reach the moon –
twice
• AT&T is thought to hold the world’s largest volume of data in one
unique database – its phone records database is 312 terabytes in
size, and contains almost 2 trillion rows
Facts
• 570 new websites spring into existence every minute of every
day
• 1.9 million IT jobs will be created in the US by 2015 to carry out
big data projects. Each of those will be supported by 3 new jobs
created outside of IT – meaning a total of 6 million new jobs
thanks to big data
• Today’s data centres occupy an area of land equal in size to
almost 6,000 football fields
• Between them, companies monitoring Twitter to measure
“sentiment” analyze 12 terabytes of tweets every day
Facts
• The amount of data transferred over mobile networks increased
by 81% to 1.5 exabytes (1.5 billion gigabytes) per month between
2012 and 2014. Video accounts for 53% of that total
• The NSA is thought to analyze 1.6% of all global internet traffic –
around 30 petabytes (30 million gigabytes) every day
• The value of the Hadoop market is expected to soar from $2
billion in 2013 to $50 billion by 2020, according to market
research firm Allied Market Research
Facts
• The number of Bits of information stored in the digital universe is
thought to have exceeded the number of stars in the physical
universe in 2007
• The boom of the Internet of Things will mean that the amount of
devices connected to the Internet will rise from about 13 billion
today to 50 billion by 2020
• 12 million RFID tags – used to capture data and track movement
of objects in the physical world – had been sold in by 2011. By
2021, it is estimated that number will have risen to 209 billion as
the Internet of Things takes off
Facts
• Big data has been used to predict crimes before they happen – a
“predictive policing” trial in California was able to identify areas
where crime will occur three times more accurately than existing
• By better integrating big data analytics into healthcare, the
industry could save $300 billion a year – that’s the equivalent of
reducing the healthcare costs of every man, woman and child by
$1,000 a year methods of forecasting
• Retailers could increase their profit margins by more than 60%
through the full exploitation of big data analytics
Facts
• The big data industry is expected to grow from US$10.2 billion in
2013 to about US$54.3 billion by 2017
What Big Data is and what it isn’t. What
problems does it solve?
What is Big Data not?
• Not regular data
• Not data that fits into the existing analytic paradigm/toolset
• Doesn’t easily fit into the existing row/column structures
New Types of Data
• Activity Data
- Listening to music
- Watching movies
- Browsing
- Driving
- Walking
- Exercising
• Intentional and Non-Intentional
New Types of Data
• Conversations
- Audio
- Visual
- Textual
New Types of Data
New Types of Data
• Image data
- Photo
• Cameras
• Phones
- Video
• Cameras
• CCTV
• YouTube!
New Types of Data
• Machine to Machine
- Cell phone to towers
- Medical devices
• Sensors
- Location
- Speed
- Acceleration
- Health
- Altitude
- Temperature
- Humidity
- Among many others….
New Types of Data
• Internet of Things
- Toasters
- Fridges
- Phones
- Jet Engines
- Combines
- Manufacturing
Abandoned Activities
• Exciting new area of big data
- What mouse pattern and keystrokes are done but eventually abandoned
- What articles do you almost comments on
- What transactions do you start but not complete?
- Do you do these share common themes?
- At what step is the transaction abandoned?
What makes Data, Big Data?
• Volume
• Velocity
• Variety
• Veracity
Volume
• If we take all the data generated from the beginning of time to 2008
- That same amount of data is now generated almost every minute
• We can now store that data across immense networks
• But there is more data than we can possibly analyze
• One airplane generates 2 terrabytes of data about it’s engines on a
flight across North America
• Half of data for analysis was surveillance video in 2012
Big Data versus Lotsa Data
• Lotsa Data
- Same structured data as you currently have
• Just more of it
- Can be analyzed using the same paradigms/toolsets
• May just take a bit longer
Velocity
• Speed that data is generated, analyzed, and the speed that data
moves around
- Think of how quickly something can be trending on Twitter
• Technology now allows us to analyze data in memory without it ever
being stored
• 200 Billion tweets per year
• Lots of prior analytics are used to static data
- But what happens to your analysis if the data is constantly changing?
• Sensor data
- Can’t store all of it, just too much data
Variety
• We used to be able to just focus on structured data in neat relational
table structures
• 80% of the data is now unstructured
- Text
- Image
- Video
- Voice
- Social Graph data
- No SQL
• Big Data Technology can now bring different types of data together to
analyze
Variety
• Structured
• Unstructured
• Semi-structured
- XML
- JSON
- Usually applied to text to try and enhance the analysis that can be done
- Stored in NoSQL databases
• MongoDB
Variety
• “Variety is the biggest factor leading companies to Big Data” – Gartner
• Do you need all three factors to make it big data?
- No
- Any two of these factors that cause the existing analysis to fall short can make
the data Big Data
• As long as one is variety
Veracity
• 4th V
- No mentioned consistently
• Definition
- Messiness or trustworthiness of data
- Involves the lineage of the data
- Twitter versus corporate data
- Is there insightful value in the data?
• 5th V - Validity
Datafication
What Problems does it solve?
• Risk Modelling
- Banks and Insurance
- Credit Card activity
• Customer churn analysis
- What were people doing just before they left?
• Recommendation engines
- LinkedIn, Amazon, Netflix
• Ad targeting
• Aggregated Transactional Analysis
- Supply Chain Management
• Threat Analysis
Who is leading the way?
• LinkedIn
- People you may know
• Netflix
- Million dollar prize
• GE/John Deere
- Predictive servicing on industrial devices
• Amazon
- Books others have also bought
• Google
- For everything
Who is leading the way?
• UPS
• United HealthCare
• Macys
• Bank of America
• Citigroup
• Verizon Wireless
• City of Brandon
- Snowclearing website
What can be achieved?
• Cost reductions
- Large Enterprises
• Time reductions
- Large Enterprises and Small Enterprises
• Better decisions
- Larges Enterprises, Small Enterprises, and Start ups
• New Offerings/Product Innovations
- Larges Enterprises, Small Enterprises, and Start ups
Why is it important?
• Big Data is here to stay
• If you don’t embrace it, your competition will. They will use it to:
- Deliver with less cost and faster
- Develop new innovations
- Understand customer better to attract them and prevent them from leaving
- Make quicker and better decisions
What is the technology behind Big Data?
Distributed Data
• Big Data just is too big to be stored on one computer
• Storing data across multiple computers allows you to take advantage
of other computer’s processing power
• SANs were the first solution to distributed data
• Cloud data is now the second generation of the solution
- Amazon S3 – Netflix
- Amazon Glacier
Could Computing
• IaaS, PaaS, SaaS, DaaS
• All made possible with virtualization
Hadoop
CAP Theorem
• Consistency
• Availability
• Scalability
• Transactional Data not a good fit for Hadoop
- Requires Consistency
• Behavioural Data is a good fit
- Health Care
- Social Media
Hadoop
• Hadoop was the named of the stuffed elephant that belonged to one
of the developers
• Not a single product
• Collection of applications
• Framework or platform
• Several modules
• Not a database
- Alternative file system with a processing library
Hadoop
• Why use Hadoop?
- Cheaper
- Faster
- Better suited to unstructured data
HDFS
• Hadoop Distributed File System that is spread across many
computers – 100’s to 10,000’s
• Not a database
- These are individual files stored across the computers
• Based off of Google’s GFS
- To index the Internet
MapReduce
• Map splits a task into many pieces
- Split a task up and send it to many computers
• Reduce takes the results and combines them back together
• Has been replaced by YARN
- Sometimes called MapReduce2
• Important feature of Yarn
- Can do stream processing in addition to batch processing
- Can also do Graph processing
MapReduce
• Programming paradigm
- Map component executes a function on each piece of data
• Execute on each node
• Bring the compute to the data
• Output key and value pairs on each node
- Reducer Aggregates the key value pairs on the nodes
• Outputs a combined list
- Mapper and Reducer are classes
• Really a Functional programming model though
• State is not shared
Hello World for MapReduce
• Wordcount
- How much wood could a woodchuck chuck if a woodchuck could chuck
wood?
- {how,1;much,1;wood,2;could,2;a,2;woodchuck,2;chuck,2;if,1}
• We are essentially labelling and counting components of data
Pig and Hive
• Pig
- Platform used to write MapReduce programs
- Use Pig Latin Language
• Hive
- Summarizes query
- Analyzes data
- Uses the HiveQL language
Additional Components
• Hbase – NoSQL database
• Storm – process streaming data
• Spark – in-memory processing
• Giraph – graph processing
Hadoop is Open Source
• Developed by engineers at Yahoo
• Now an open-source project
- You may hear about apache-Hadoop, apache-pig
• Hadoop is free
• Anyone can download or modify
What are large companies and startups using
Big Data for?
General Uses
• Monitoring production lines
• Smart meters for utilities
• Environmental Monitoring
• Infrastructure Management (bridges, railways)
• Supply chain network
• Predictive maintenance
• Energy Management
• Medical and Health Care systems
• Home Automation
Common Applications of Big Data
• For Consumers
- SIRI and Yelp
- Spotify and Amazon
- LinkedIn
- NetFlix
- Google Now
• GPS aware
• Provide pro-active advice on traffic and meal
Common Applications of Big Data
• For Business
- Google Ad searches
- John Deere
- Boeing
- American Airlines
- Orbitz
- Fraud Detection
• The way you move your mouse and navigate website are distinctive
Common Applications of Big Data
• Google flu trends
- Anybody know how it is generated?
- Based on Google search history rather than lab results
• It was found this was more accurate
General Uses of Big Data
Monitoring and Anomaly Detection
• Monitoring detects specific events
• Anomaly Detection detects unexpected events
- Unusual Activity
- Could be on a combination of criteria
- Usually requires human attention
- Invites inspection
- Hey, I’m not sure this is an issue but it isn’t common
• Big Data allows for more detail
- Extremely rare events
- Combination of a large number of factors
• Measure 1,000 different factors at once
• SPAM – Big Data Collection - GMAIL
Data Mining and Text Analytics
• Search for unexpected patterns
- Supermarket/stock investments/set of symptoms
• Text Mining
- Focuses on content
- Sentiment analysis
• Positive/negative
• Work best with very large data sets
Predictive Analytics
• Crystal Ball of Big Data
• Nate Silver
- Accurately predicted every state in 2012 election
- Combined election polls and weighted them by reliability
- ESPN bought his website
• Go visit for March Madness
• Netflix – 10% prize
- Ensemble Model
- Go to kaggle.com
• Offer compensation for people to create predictive models for them
• Free data to teach predictive modeling
Visualization
• Computers spot certain patterns
• Computers excel at predictive models
• Computers excel at data mining
• Humans perceive and interpret better
• Human vision plays an important role in Big Data
What Humans do well
• Identifying visual patterns
• Identifying anomalies
• Seeing patterns across groups
• Interpreting content of images
Gestalt Patterns
How to create a Big Data strategy and what
people and skills will you need for Big Data?
Data Scientists
Data Scientists
• Data Scientists need to be able to have all three competencies
- Coding
- Statistics
- Domain Knowledge
Domain Knowledge
Coding
• Competencies to combine a variety of data to determine patterns and
trends
Types and Skills in Data Science
• “Analyzing the analyzers”
- 40 page book
- Studied 250 data scientists
Types and Skills in Data Science
Types and Skills in Data Science
What should be your strategy?
• Are you Conservative/Moderate/Aggressive ?
• Factors
- Competitors
- Is Industry Technology focused?
- Availability of data
- Data expertise
Big Data Strategy
• Build awareness/competencies
• Low cost of entry
- Open Source
- Cloud based hosting
- Unlike expensive Analytics, this is available to everyone
• Create Big Data Targets - pain point for efficiency or improvement
- Which business process needs better decision making
- Which business process needs faster decision making
- Is someone likely to employ Big Data? If so, where?
- Are we processing large amounts of data that could be made better?
- Could we create a new/enhanced data driven product or service?
Big Data Strategy
• For the potential Big Data targets, is there additional data surrounding
the target that would allow for better decision making?
- Can we acquire that data and incorporate the data into our analysis
- How can we combine different types of data to improve our analysis that
have not been combined before
• Refunds/length of time person spent in store originally?
• Refunds/Salesperson?
• Experiment with a solution and iterate
• Always start with a business problem that could have a Big Data
solution
- Too many refunds or losing clients
- Big Data is not a solution unto itself
- Learn from the Data Warehouse projects
Questions?

Ictam big data

  • 1.
    ©2015 Protegra Inc.All rights reserved. Big Data Terry Bunio - Protegra
  • 2.
    Who Am I? •Data Base Administrator - Oracle - SQL Server 6,6.5,7,2000,2005,2008,2012 - Informix - ADABAS • Data Modeler/Architect - Investors Group, LPL Financial, Manitoba Blue Cross, Assante Financial, CI Funds, Mackenzie Financial - Normalized and Dimensional • Agilist - Innovation Gamer, Team Member, SQL Developer, Test writer, Sticky Sticker, Project Manager, PMO on SAP Implementation
  • 3.
  • 4.
    Members of theProtegra Community of software-driven businesses & solutions
  • 7.
  • 8.
    Myths • Big DataMyth #1: It’s Big • Big Data Myth #2: You need to apply it right away • Big Data Myth #3: The more granular the data, the better • Big Data Myth #4: Big Data is good data • Big Data Myth #5: Big Data means that analysts become all-important • Big Data Myth #6: Big Data gives you concrete answers • Big Data Myth #7: Big Data is a magic 8-ball • Big Data Myth #8: Big Data can create self-learning algorithms
  • 9.
    Evolution Term Time Frame DecisionSupport 1970-1985 Executive Support 1980-1990 Online Analytical Processing 1990-2000 Business Intelligence 1989-2005 Analytics 2005-2010 Big Data 2010-2015 Next??
  • 10.
    Facts • Every 2days we create as much information as we did from the beginning of time until 2003 • Over 90% of all the data in the world was created in the past 2 years • It is expected that by 2020 the amount of digital information in existence will have grown from 3.2 zettabytes today to 40 zettabytes • The total amount of data being captured and stored by industry doubles every 1.2 years • Every minute we send 204 million emails, generate 1.8 million Facebook likes, send 278 thousand Tweets, and up-load 200,000 photos to Facebook
  • 11.
    Facts • Google aloneprocesses on average over 40 thousand search queries per second, making it over 3.5 billion in a single day • Around 100 hours of video are uploaded to YouTube every minute and it would take you around 15 years to watch every video uploaded by users in one day • If you burned all of the data created in just one day onto DVDs, you could stack them on top of each other and reach the moon – twice • AT&T is thought to hold the world’s largest volume of data in one unique database – its phone records database is 312 terabytes in size, and contains almost 2 trillion rows
  • 12.
    Facts • 570 newwebsites spring into existence every minute of every day • 1.9 million IT jobs will be created in the US by 2015 to carry out big data projects. Each of those will be supported by 3 new jobs created outside of IT – meaning a total of 6 million new jobs thanks to big data • Today’s data centres occupy an area of land equal in size to almost 6,000 football fields • Between them, companies monitoring Twitter to measure “sentiment” analyze 12 terabytes of tweets every day
  • 13.
    Facts • The amountof data transferred over mobile networks increased by 81% to 1.5 exabytes (1.5 billion gigabytes) per month between 2012 and 2014. Video accounts for 53% of that total • The NSA is thought to analyze 1.6% of all global internet traffic – around 30 petabytes (30 million gigabytes) every day • The value of the Hadoop market is expected to soar from $2 billion in 2013 to $50 billion by 2020, according to market research firm Allied Market Research
  • 14.
    Facts • The numberof Bits of information stored in the digital universe is thought to have exceeded the number of stars in the physical universe in 2007 • The boom of the Internet of Things will mean that the amount of devices connected to the Internet will rise from about 13 billion today to 50 billion by 2020 • 12 million RFID tags – used to capture data and track movement of objects in the physical world – had been sold in by 2011. By 2021, it is estimated that number will have risen to 209 billion as the Internet of Things takes off
  • 15.
    Facts • Big datahas been used to predict crimes before they happen – a “predictive policing” trial in California was able to identify areas where crime will occur three times more accurately than existing • By better integrating big data analytics into healthcare, the industry could save $300 billion a year – that’s the equivalent of reducing the healthcare costs of every man, woman and child by $1,000 a year methods of forecasting • Retailers could increase their profit margins by more than 60% through the full exploitation of big data analytics
  • 16.
    Facts • The bigdata industry is expected to grow from US$10.2 billion in 2013 to about US$54.3 billion by 2017
  • 18.
    What Big Datais and what it isn’t. What problems does it solve?
  • 19.
    What is BigData not? • Not regular data • Not data that fits into the existing analytic paradigm/toolset • Doesn’t easily fit into the existing row/column structures
  • 20.
    New Types ofData • Activity Data - Listening to music - Watching movies - Browsing - Driving - Walking - Exercising • Intentional and Non-Intentional
  • 21.
    New Types ofData • Conversations - Audio - Visual - Textual
  • 22.
  • 23.
    New Types ofData • Image data - Photo • Cameras • Phones - Video • Cameras • CCTV • YouTube!
  • 24.
    New Types ofData • Machine to Machine - Cell phone to towers - Medical devices • Sensors - Location - Speed - Acceleration - Health - Altitude - Temperature - Humidity - Among many others….
  • 25.
    New Types ofData • Internet of Things - Toasters - Fridges - Phones - Jet Engines - Combines - Manufacturing
  • 26.
    Abandoned Activities • Excitingnew area of big data - What mouse pattern and keystrokes are done but eventually abandoned - What articles do you almost comments on - What transactions do you start but not complete? - Do you do these share common themes? - At what step is the transaction abandoned?
  • 27.
    What makes Data,Big Data? • Volume • Velocity • Variety • Veracity
  • 28.
    Volume • If wetake all the data generated from the beginning of time to 2008 - That same amount of data is now generated almost every minute • We can now store that data across immense networks • But there is more data than we can possibly analyze • One airplane generates 2 terrabytes of data about it’s engines on a flight across North America • Half of data for analysis was surveillance video in 2012
  • 29.
    Big Data versusLotsa Data • Lotsa Data - Same structured data as you currently have • Just more of it - Can be analyzed using the same paradigms/toolsets • May just take a bit longer
  • 30.
    Velocity • Speed thatdata is generated, analyzed, and the speed that data moves around - Think of how quickly something can be trending on Twitter • Technology now allows us to analyze data in memory without it ever being stored • 200 Billion tweets per year • Lots of prior analytics are used to static data - But what happens to your analysis if the data is constantly changing? • Sensor data - Can’t store all of it, just too much data
  • 31.
    Variety • We usedto be able to just focus on structured data in neat relational table structures • 80% of the data is now unstructured - Text - Image - Video - Voice - Social Graph data - No SQL • Big Data Technology can now bring different types of data together to analyze
  • 32.
    Variety • Structured • Unstructured •Semi-structured - XML - JSON - Usually applied to text to try and enhance the analysis that can be done - Stored in NoSQL databases • MongoDB
  • 33.
    Variety • “Variety isthe biggest factor leading companies to Big Data” – Gartner • Do you need all three factors to make it big data? - No - Any two of these factors that cause the existing analysis to fall short can make the data Big Data • As long as one is variety
  • 34.
    Veracity • 4th V -No mentioned consistently • Definition - Messiness or trustworthiness of data - Involves the lineage of the data - Twitter versus corporate data - Is there insightful value in the data? • 5th V - Validity
  • 35.
  • 36.
    What Problems doesit solve? • Risk Modelling - Banks and Insurance - Credit Card activity • Customer churn analysis - What were people doing just before they left? • Recommendation engines - LinkedIn, Amazon, Netflix • Ad targeting • Aggregated Transactional Analysis - Supply Chain Management • Threat Analysis
  • 37.
    Who is leadingthe way? • LinkedIn - People you may know • Netflix - Million dollar prize • GE/John Deere - Predictive servicing on industrial devices • Amazon - Books others have also bought • Google - For everything
  • 38.
    Who is leadingthe way? • UPS • United HealthCare • Macys • Bank of America • Citigroup • Verizon Wireless • City of Brandon - Snowclearing website
  • 39.
    What can beachieved? • Cost reductions - Large Enterprises • Time reductions - Large Enterprises and Small Enterprises • Better decisions - Larges Enterprises, Small Enterprises, and Start ups • New Offerings/Product Innovations - Larges Enterprises, Small Enterprises, and Start ups
  • 40.
    Why is itimportant? • Big Data is here to stay • If you don’t embrace it, your competition will. They will use it to: - Deliver with less cost and faster - Develop new innovations - Understand customer better to attract them and prevent them from leaving - Make quicker and better decisions
  • 41.
    What is thetechnology behind Big Data?
  • 42.
    Distributed Data • BigData just is too big to be stored on one computer • Storing data across multiple computers allows you to take advantage of other computer’s processing power • SANs were the first solution to distributed data • Cloud data is now the second generation of the solution - Amazon S3 – Netflix - Amazon Glacier
  • 43.
    Could Computing • IaaS,PaaS, SaaS, DaaS • All made possible with virtualization
  • 44.
  • 45.
    CAP Theorem • Consistency •Availability • Scalability • Transactional Data not a good fit for Hadoop - Requires Consistency • Behavioural Data is a good fit - Health Care - Social Media
  • 46.
    Hadoop • Hadoop wasthe named of the stuffed elephant that belonged to one of the developers • Not a single product • Collection of applications • Framework or platform • Several modules • Not a database - Alternative file system with a processing library
  • 47.
    Hadoop • Why useHadoop? - Cheaper - Faster - Better suited to unstructured data
  • 48.
    HDFS • Hadoop DistributedFile System that is spread across many computers – 100’s to 10,000’s • Not a database - These are individual files stored across the computers • Based off of Google’s GFS - To index the Internet
  • 49.
    MapReduce • Map splitsa task into many pieces - Split a task up and send it to many computers • Reduce takes the results and combines them back together • Has been replaced by YARN - Sometimes called MapReduce2 • Important feature of Yarn - Can do stream processing in addition to batch processing - Can also do Graph processing
  • 50.
    MapReduce • Programming paradigm -Map component executes a function on each piece of data • Execute on each node • Bring the compute to the data • Output key and value pairs on each node - Reducer Aggregates the key value pairs on the nodes • Outputs a combined list - Mapper and Reducer are classes • Really a Functional programming model though • State is not shared
  • 51.
    Hello World forMapReduce • Wordcount - How much wood could a woodchuck chuck if a woodchuck could chuck wood? - {how,1;much,1;wood,2;could,2;a,2;woodchuck,2;chuck,2;if,1} • We are essentially labelling and counting components of data
  • 52.
    Pig and Hive •Pig - Platform used to write MapReduce programs - Use Pig Latin Language • Hive - Summarizes query - Analyzes data - Uses the HiveQL language
  • 53.
    Additional Components • Hbase– NoSQL database • Storm – process streaming data • Spark – in-memory processing • Giraph – graph processing
  • 54.
    Hadoop is OpenSource • Developed by engineers at Yahoo • Now an open-source project - You may hear about apache-Hadoop, apache-pig • Hadoop is free • Anyone can download or modify
  • 55.
    What are largecompanies and startups using Big Data for?
  • 56.
    General Uses • Monitoringproduction lines • Smart meters for utilities • Environmental Monitoring • Infrastructure Management (bridges, railways) • Supply chain network • Predictive maintenance • Energy Management • Medical and Health Care systems • Home Automation
  • 57.
    Common Applications ofBig Data • For Consumers - SIRI and Yelp - Spotify and Amazon - LinkedIn - NetFlix - Google Now • GPS aware • Provide pro-active advice on traffic and meal
  • 58.
    Common Applications ofBig Data • For Business - Google Ad searches - John Deere - Boeing - American Airlines - Orbitz - Fraud Detection • The way you move your mouse and navigate website are distinctive
  • 59.
    Common Applications ofBig Data • Google flu trends - Anybody know how it is generated? - Based on Google search history rather than lab results • It was found this was more accurate
  • 60.
  • 61.
    Monitoring and AnomalyDetection • Monitoring detects specific events • Anomaly Detection detects unexpected events - Unusual Activity - Could be on a combination of criteria - Usually requires human attention - Invites inspection - Hey, I’m not sure this is an issue but it isn’t common • Big Data allows for more detail - Extremely rare events - Combination of a large number of factors • Measure 1,000 different factors at once • SPAM – Big Data Collection - GMAIL
  • 62.
    Data Mining andText Analytics • Search for unexpected patterns - Supermarket/stock investments/set of symptoms • Text Mining - Focuses on content - Sentiment analysis • Positive/negative • Work best with very large data sets
  • 63.
    Predictive Analytics • CrystalBall of Big Data • Nate Silver - Accurately predicted every state in 2012 election - Combined election polls and weighted them by reliability - ESPN bought his website • Go visit for March Madness • Netflix – 10% prize - Ensemble Model - Go to kaggle.com • Offer compensation for people to create predictive models for them • Free data to teach predictive modeling
  • 64.
    Visualization • Computers spotcertain patterns • Computers excel at predictive models • Computers excel at data mining • Humans perceive and interpret better • Human vision plays an important role in Big Data
  • 65.
    What Humans dowell • Identifying visual patterns • Identifying anomalies • Seeing patterns across groups • Interpreting content of images
  • 66.
  • 67.
    How to createa Big Data strategy and what people and skills will you need for Big Data?
  • 68.
  • 69.
    Data Scientists • DataScientists need to be able to have all three competencies - Coding - Statistics - Domain Knowledge
  • 70.
  • 71.
    Coding • Competencies tocombine a variety of data to determine patterns and trends
  • 72.
    Types and Skillsin Data Science • “Analyzing the analyzers” - 40 page book - Studied 250 data scientists
  • 73.
    Types and Skillsin Data Science
  • 74.
    Types and Skillsin Data Science
  • 75.
    What should beyour strategy? • Are you Conservative/Moderate/Aggressive ? • Factors - Competitors - Is Industry Technology focused? - Availability of data - Data expertise
  • 76.
    Big Data Strategy •Build awareness/competencies • Low cost of entry - Open Source - Cloud based hosting - Unlike expensive Analytics, this is available to everyone • Create Big Data Targets - pain point for efficiency or improvement - Which business process needs better decision making - Which business process needs faster decision making - Is someone likely to employ Big Data? If so, where? - Are we processing large amounts of data that could be made better? - Could we create a new/enhanced data driven product or service?
  • 77.
    Big Data Strategy •For the potential Big Data targets, is there additional data surrounding the target that would allow for better decision making? - Can we acquire that data and incorporate the data into our analysis - How can we combine different types of data to improve our analysis that have not been combined before • Refunds/length of time person spent in store originally? • Refunds/Salesperson? • Experiment with a solution and iterate • Always start with a business problem that could have a Big Data solution - Too many refunds or losing clients - Big Data is not a solution unto itself - Learn from the Data Warehouse projects
  • 78.