2. Who Am I?
• Data Base Administrator
- Oracle
- SQL Server 6,6.5,7,2000,2005,2008,2012
- Informix
- ADABAS
• Data Modeler/Architect
- Investors Group, LPL Financial, Manitoba Blue Cross, Assante
Financial, CI Funds, Mackenzie Financial
- Normalized and Dimensional
• Agilist
- Innovation Gamer, Team Member, SQL Developer, Test writer, Sticky
Sticker, Project Manager, PMO on SAP Implementation
8. Myths
• Big Data Myth #1: It’s Big
• Big Data Myth #2: You need to apply it right away
• Big Data Myth #3: The more granular the data, the better
• Big Data Myth #4: Big Data is good data
• Big Data Myth #5: Big Data means that analysts become all-important
• Big Data Myth #6: Big Data gives you concrete answers
• Big Data Myth #7: Big Data is a magic 8-ball
• Big Data Myth #8: Big Data can create self-learning algorithms
9. Evolution
Term Time Frame
Decision Support 1970-1985
Executive Support 1980-1990
Online Analytical Processing 1990-2000
Business Intelligence 1989-2005
Analytics 2005-2010
Big Data 2010-2015
Next??
10. Facts
• Every 2 days we create as much information as we did from the
beginning of time until 2003
• Over 90% of all the data in the world was created in the past 2 years
• It is expected that by 2020 the amount of digital information in
existence will have grown from 3.2 zettabytes today to 40 zettabytes
• The total amount of data being captured and stored by industry
doubles every 1.2 years
• Every minute we send 204 million emails, generate 1.8 million
Facebook likes, send 278 thousand Tweets, and up-load 200,000
photos to Facebook
11. Facts
• Google alone processes on average over 40 thousand search
queries per second, making it over 3.5 billion in a single day
• Around 100 hours of video are uploaded to YouTube every
minute and it would take you around 15 years to watch every
video uploaded by users in one day
• If you burned all of the data created in just one day onto DVDs,
you could stack them on top of each other and reach the moon –
twice
• AT&T is thought to hold the world’s largest volume of data in one
unique database – its phone records database is 312 terabytes in
size, and contains almost 2 trillion rows
12. Facts
• 570 new websites spring into existence every minute of every
day
• 1.9 million IT jobs will be created in the US by 2015 to carry out
big data projects. Each of those will be supported by 3 new jobs
created outside of IT – meaning a total of 6 million new jobs
thanks to big data
• Today’s data centres occupy an area of land equal in size to
almost 6,000 football fields
• Between them, companies monitoring Twitter to measure
“sentiment” analyze 12 terabytes of tweets every day
13. Facts
• The amount of data transferred over mobile networks increased
by 81% to 1.5 exabytes (1.5 billion gigabytes) per month between
2012 and 2014. Video accounts for 53% of that total
• The NSA is thought to analyze 1.6% of all global internet traffic –
around 30 petabytes (30 million gigabytes) every day
• The value of the Hadoop market is expected to soar from $2
billion in 2013 to $50 billion by 2020, according to market
research firm Allied Market Research
14. Facts
• The number of Bits of information stored in the digital universe is
thought to have exceeded the number of stars in the physical
universe in 2007
• The boom of the Internet of Things will mean that the amount of
devices connected to the Internet will rise from about 13 billion
today to 50 billion by 2020
• 12 million RFID tags – used to capture data and track movement
of objects in the physical world – had been sold in by 2011. By
2021, it is estimated that number will have risen to 209 billion as
the Internet of Things takes off
15. Facts
• Big data has been used to predict crimes before they happen – a
“predictive policing” trial in California was able to identify areas
where crime will occur three times more accurately than existing
• By better integrating big data analytics into healthcare, the
industry could save $300 billion a year – that’s the equivalent of
reducing the healthcare costs of every man, woman and child by
$1,000 a year methods of forecasting
• Retailers could increase their profit margins by more than 60%
through the full exploitation of big data analytics
16. Facts
• The big data industry is expected to grow from US$10.2 billion in
2013 to about US$54.3 billion by 2017
17.
18. What Big Data is and what it isn’t. What
problems does it solve?
19. What is Big Data not?
• Not regular data
• Not data that fits into the existing analytic paradigm/toolset
• Doesn’t easily fit into the existing row/column structures
20. New Types of Data
• Activity Data
- Listening to music
- Watching movies
- Browsing
- Driving
- Walking
- Exercising
• Intentional and Non-Intentional
21. New Types of Data
• Conversations
- Audio
- Visual
- Textual
23. New Types of Data
• Image data
- Photo
• Cameras
• Phones
- Video
• Cameras
• CCTV
• YouTube!
24. New Types of Data
• Machine to Machine
- Cell phone to towers
- Medical devices
• Sensors
- Location
- Speed
- Acceleration
- Health
- Altitude
- Temperature
- Humidity
- Among many others….
25. New Types of Data
• Internet of Things
- Toasters
- Fridges
- Phones
- Jet Engines
- Combines
- Manufacturing
26. Abandoned Activities
• Exciting new area of big data
- What mouse pattern and keystrokes are done but eventually abandoned
- What articles do you almost comments on
- What transactions do you start but not complete?
- Do you do these share common themes?
- At what step is the transaction abandoned?
27. What makes Data, Big Data?
• Volume
• Velocity
• Variety
• Veracity
28. Volume
• If we take all the data generated from the beginning of time to 2008
- That same amount of data is now generated almost every minute
• We can now store that data across immense networks
• But there is more data than we can possibly analyze
• One airplane generates 2 terrabytes of data about it’s engines on a
flight across North America
• Half of data for analysis was surveillance video in 2012
29. Big Data versus Lotsa Data
• Lotsa Data
- Same structured data as you currently have
• Just more of it
- Can be analyzed using the same paradigms/toolsets
• May just take a bit longer
30. Velocity
• Speed that data is generated, analyzed, and the speed that data
moves around
- Think of how quickly something can be trending on Twitter
• Technology now allows us to analyze data in memory without it ever
being stored
• 200 Billion tweets per year
• Lots of prior analytics are used to static data
- But what happens to your analysis if the data is constantly changing?
• Sensor data
- Can’t store all of it, just too much data
31. Variety
• We used to be able to just focus on structured data in neat relational
table structures
• 80% of the data is now unstructured
- Text
- Image
- Video
- Voice
- Social Graph data
- No SQL
• Big Data Technology can now bring different types of data together to
analyze
32. Variety
• Structured
• Unstructured
• Semi-structured
- XML
- JSON
- Usually applied to text to try and enhance the analysis that can be done
- Stored in NoSQL databases
• MongoDB
33. Variety
• “Variety is the biggest factor leading companies to Big Data” – Gartner
• Do you need all three factors to make it big data?
- No
- Any two of these factors that cause the existing analysis to fall short can make
the data Big Data
• As long as one is variety
34. Veracity
• 4th V
- No mentioned consistently
• Definition
- Messiness or trustworthiness of data
- Involves the lineage of the data
- Twitter versus corporate data
- Is there insightful value in the data?
• 5th V - Validity
36. What Problems does it solve?
• Risk Modelling
- Banks and Insurance
- Credit Card activity
• Customer churn analysis
- What were people doing just before they left?
• Recommendation engines
- LinkedIn, Amazon, Netflix
• Ad targeting
• Aggregated Transactional Analysis
- Supply Chain Management
• Threat Analysis
37. Who is leading the way?
• LinkedIn
- People you may know
• Netflix
- Million dollar prize
• GE/John Deere
- Predictive servicing on industrial devices
• Amazon
- Books others have also bought
• Google
- For everything
38. Who is leading the way?
• UPS
• United HealthCare
• Macys
• Bank of America
• Citigroup
• Verizon Wireless
• City of Brandon
- Snowclearing website
39. What can be achieved?
• Cost reductions
- Large Enterprises
• Time reductions
- Large Enterprises and Small Enterprises
• Better decisions
- Larges Enterprises, Small Enterprises, and Start ups
• New Offerings/Product Innovations
- Larges Enterprises, Small Enterprises, and Start ups
40. Why is it important?
• Big Data is here to stay
• If you don’t embrace it, your competition will. They will use it to:
- Deliver with less cost and faster
- Develop new innovations
- Understand customer better to attract them and prevent them from leaving
- Make quicker and better decisions
42. Distributed Data
• Big Data just is too big to be stored on one computer
• Storing data across multiple computers allows you to take advantage
of other computer’s processing power
• SANs were the first solution to distributed data
• Cloud data is now the second generation of the solution
- Amazon S3 – Netflix
- Amazon Glacier
45. CAP Theorem
• Consistency
• Availability
• Scalability
• Transactional Data not a good fit for Hadoop
- Requires Consistency
• Behavioural Data is a good fit
- Health Care
- Social Media
46. Hadoop
• Hadoop was the named of the stuffed elephant that belonged to one
of the developers
• Not a single product
• Collection of applications
• Framework or platform
• Several modules
• Not a database
- Alternative file system with a processing library
47. Hadoop
• Why use Hadoop?
- Cheaper
- Faster
- Better suited to unstructured data
48. HDFS
• Hadoop Distributed File System that is spread across many
computers – 100’s to 10,000’s
• Not a database
- These are individual files stored across the computers
• Based off of Google’s GFS
- To index the Internet
49. MapReduce
• Map splits a task into many pieces
- Split a task up and send it to many computers
• Reduce takes the results and combines them back together
• Has been replaced by YARN
- Sometimes called MapReduce2
• Important feature of Yarn
- Can do stream processing in addition to batch processing
- Can also do Graph processing
50. MapReduce
• Programming paradigm
- Map component executes a function on each piece of data
• Execute on each node
• Bring the compute to the data
• Output key and value pairs on each node
- Reducer Aggregates the key value pairs on the nodes
• Outputs a combined list
- Mapper and Reducer are classes
• Really a Functional programming model though
• State is not shared
51. Hello World for MapReduce
• Wordcount
- How much wood could a woodchuck chuck if a woodchuck could chuck
wood?
- {how,1;much,1;wood,2;could,2;a,2;woodchuck,2;chuck,2;if,1}
• We are essentially labelling and counting components of data
52. Pig and Hive
• Pig
- Platform used to write MapReduce programs
- Use Pig Latin Language
• Hive
- Summarizes query
- Analyzes data
- Uses the HiveQL language
54. Hadoop is Open Source
• Developed by engineers at Yahoo
• Now an open-source project
- You may hear about apache-Hadoop, apache-pig
• Hadoop is free
• Anyone can download or modify
55. What are large companies and startups using
Big Data for?
56. General Uses
• Monitoring production lines
• Smart meters for utilities
• Environmental Monitoring
• Infrastructure Management (bridges, railways)
• Supply chain network
• Predictive maintenance
• Energy Management
• Medical and Health Care systems
• Home Automation
57. Common Applications of Big Data
• For Consumers
- SIRI and Yelp
- Spotify and Amazon
- LinkedIn
- NetFlix
- Google Now
• GPS aware
• Provide pro-active advice on traffic and meal
58. Common Applications of Big Data
• For Business
- Google Ad searches
- John Deere
- Boeing
- American Airlines
- Orbitz
- Fraud Detection
• The way you move your mouse and navigate website are distinctive
59. Common Applications of Big Data
• Google flu trends
- Anybody know how it is generated?
- Based on Google search history rather than lab results
• It was found this was more accurate
61. Monitoring and Anomaly Detection
• Monitoring detects specific events
• Anomaly Detection detects unexpected events
- Unusual Activity
- Could be on a combination of criteria
- Usually requires human attention
- Invites inspection
- Hey, I’m not sure this is an issue but it isn’t common
• Big Data allows for more detail
- Extremely rare events
- Combination of a large number of factors
• Measure 1,000 different factors at once
• SPAM – Big Data Collection - GMAIL
62. Data Mining and Text Analytics
• Search for unexpected patterns
- Supermarket/stock investments/set of symptoms
• Text Mining
- Focuses on content
- Sentiment analysis
• Positive/negative
• Work best with very large data sets
63. Predictive Analytics
• Crystal Ball of Big Data
• Nate Silver
- Accurately predicted every state in 2012 election
- Combined election polls and weighted them by reliability
- ESPN bought his website
• Go visit for March Madness
• Netflix – 10% prize
- Ensemble Model
- Go to kaggle.com
• Offer compensation for people to create predictive models for them
• Free data to teach predictive modeling
64. Visualization
• Computers spot certain patterns
• Computers excel at predictive models
• Computers excel at data mining
• Humans perceive and interpret better
• Human vision plays an important role in Big Data
65. What Humans do well
• Identifying visual patterns
• Identifying anomalies
• Seeing patterns across groups
• Interpreting content of images
75. What should be your strategy?
• Are you Conservative/Moderate/Aggressive ?
• Factors
- Competitors
- Is Industry Technology focused?
- Availability of data
- Data expertise
76. Big Data Strategy
• Build awareness/competencies
• Low cost of entry
- Open Source
- Cloud based hosting
- Unlike expensive Analytics, this is available to everyone
• Create Big Data Targets - pain point for efficiency or improvement
- Which business process needs better decision making
- Which business process needs faster decision making
- Is someone likely to employ Big Data? If so, where?
- Are we processing large amounts of data that could be made better?
- Could we create a new/enhanced data driven product or service?
77. Big Data Strategy
• For the potential Big Data targets, is there additional data surrounding
the target that would allow for better decision making?
- Can we acquire that data and incorporate the data into our analysis
- How can we combine different types of data to improve our analysis that
have not been combined before
• Refunds/length of time person spent in store originally?
• Refunds/Salesperson?
• Experiment with a solution and iterate
• Always start with a business problem that could have a Big Data
solution
- Too many refunds or losing clients
- Big Data is not a solution unto itself
- Learn from the Data Warehouse projects