Big Data:

Your New Best Friend
Reuven M. Lerner, PhD
MegaComm 2016 • February 18th, 2016
1 Big Data.key - February 18, 2016
Who am I?
• Long-time programmer, consultant, trainer
• Python, Git, PostgreSQL, Ruby
• Linux Journal columnist
2
2 Big Data.key - February 18, 2016
My stuff
• Newsletter: http://lerner.co.il/newsletter
• Blog: http://blog.lerner.co.il/
• Daily Tech Video: http://dailytechvideo.com/
• Or @DailyTechVideo on Twitter
• Mandarin Weekly: http://MandarinWeekly.com
• Or @MandarinWeekly on Twitter
3
3 Big Data.key - February 18, 2016
Elections!
• Israel had elections last year
• The United States has elections this year
• Rumor has it, the world contains some other
countries, many of which also hold elections
4
4 Big Data.key - February 18, 2016
Polls
• Before an election, politicians, reports, and political
junkies (like me) look at the polls.
• We want to know who is ahead, and who is behind
• The politicians want to know which groups like (and
dislike) them, so that they can focus their rhetoric
and campaigning
5
5 Big Data.key - February 18, 2016
Are the polls always right?
6
6 Big Data.key - February 18, 2016
7
7 Big Data.key - February 18, 2016
8
8 Big Data.key - February 18, 2016
Polls are statistical models
• Polls use math to predict the likelihood of a
particular outcome, based on a number of inputs
• Models are toy versions of reality
• They allow us to explore and understand reality,
and should bear some connection to it
• But there will always be a distinction between a
model and the real world
9
9 Big Data.key - February 18, 2016
Models are important!
• They allow us to explore, understand the world
• They enable us to make predictions
• They reduce costs, and allow us to do things that
are otherwise impossible or unethical
• 2013 Nobel Prize in Chemistry — for scientists who
engaged in modeling of chemistry
10
10 Big Data.key - February 18, 2016
Testing models
• Elections are unusual: You only have one shot at
testing your model to see if it’s accurate
• But in your business, you can create and test
models every day, modifying the number, type, and
weights of the inputs
11
11 Big Data.key - February 18, 2016
Big data
• 90% of the data ever created was generated in the
last two years (according to IBM):
• Writing, video, audio
• Travel, e-commerce, electricity use, phone calls
• Metadata, as well
• Maybe people aren’t just numbers… but given how
often we’re quantified, we’re not that far away
12
12 Big Data.key - February 18, 2016
13
13 Big Data.key - February 18, 2016
14
14 Big Data.key - February 18, 2016
But numbers are good
(if you’re a computer)
• Modern computers can hold billions of them
• Store not only information about people, but their
characteristics and traits, as well as dates and
times
15
15 Big Data.key - February 18, 2016
Your business
• When you make business decisions, what factors
are you considering?
• Are you trying to check all of the possible
correlations, across all of the data?
• Or are you sampling, and hoping that your sample
is an accurate and representative one?
16
16 Big Data.key - February 18, 2016
• Your business is now collecting lots and lots of data
• Who is buying your products and services?
• How often do they visit your Web site?
• Which of your e-mail messages do they open?
• What do they buy?
• How old are they, and where do they live?
Enter big data!
17
17 Big Data.key - February 18, 2016
Why “big” data?
• It sounds sophisticated and high-tech.
• There really is a lot of it.
• Often, there’s more than we can fit (or process) on
a single computer
• But often, it’s not really that big
18
18 Big Data.key - February 18, 2016
Enter data science
• Data scientists come up with ways to turn raw data
into useful information
• They create and use models to find correlations
among the many pieces of data you’re collecting
• They can help you use these correlations to
improve your marketing, sales, and production
19
19 Big Data.key - February 18, 2016
What is data science?
A person employed to analyze and interpret complex
digital data, such as the usage statistics of a website,
especially in order to assist a business in its decision-
making.
— Oxford Dictionary
20 Big Data.key - February 18, 2016
More realistically…
Data scientist (noun): Person who is better at
statistics than any soft‐ ware engineer and better at
software engineering than any statistician.
— Josh Wills
21 Big Data.key - February 18, 2016
Graphically…
From Drew Conway, 2010
22 Big Data.key - February 18, 2016
Look for correlations
• Data scientists look for correlations
• Using those correlations, we know where we have
been successful (and not)
• These can be interesting, useful, or crucial
• Being able to analyze lots of factors, and thus find
correlations in them, allows our models to be more
sophisticated — and also predictive
23
23 Big Data.key - February 18, 2016
24 Big Data.key - February 18, 2016
Spurious correlations
• http://tylervigen.com/spurious-correlations
25 Big Data.key - February 18, 2016
Data scientists’ tools
• Programming languages + libraries
• Data sets
• Machine learning
• Distributed processing systems
26 Big Data.key - February 18, 2016
Programming languages
• R
• Python
• Julia
• Clojure
27 Big Data.key - February 18, 2016
Data sets
• Your own
• Public ones
28 Big Data.key - February 18, 2016
What do data sets look like?
• Excel spreadsheets
• CSV files
• Multiple CSV files (e.g., separated by date)
• Databases you can clone — but this is rare
29 Big Data.key - February 18, 2016
Cleaning the data
• Remove bad, incomplete data
• Remove data that isn’t relevant for the investigation
you’re doing
• But don’t remove too much, ruining your data!
30 Big Data.key - February 18, 2016
Machine learning
• The computer can learn to categorize things as
well as humans do
• Then, when given new data, it can decide into
which category to put the new item
31 Big Data.key - February 18, 2016
Spam filters
• Spam filters use a simple form of machine learning
• Is a particular e-mail message spam?
• Check the contents, using a variety of factors
• If the factors make this document similar to other
spam documents, then mark it as spam
32 Big Data.key - February 18, 2016
Aha!
• Wondering why e-mail from certain people always
gets put into the “junk” e-mail box?
• Because those people send mail that looks (to the
machine-learning system) too much like junk
• Mark the messages as not being junk, so your
spam-control system can learn over time
33 Big Data.key - February 18, 2016
Experience is important
• In people, learning is a matter of experience
• Machine learning is all about computers also
gaining that experience
34 Big Data.key - February 18, 2016
35 Big Data.key - February 18, 2016
36 Big Data.key - February 18, 2016
37 Big Data.key - February 18, 2016
38 Big Data.key - February 18, 2016
Models
• Machine learning employs many models
• Each model uses different techniques to train the
computer into which categories data should be put
• Supervised vs. unsupervised learning
• The computer can then be given new data
39 Big Data.key - February 18, 2016
Example: K nearest
neighbors
• One common machine-learning algorithm finds the
closest k (a number) items to a new piece of data
• We then have an election — to which category
does most existing data belong?
• Our new data point joins the majority category
40 Big Data.key - February 18, 2016
Lots of other models
• Linear regression
• Logistic regression
• Neural networks
• Deep learning
• K-means clustering
• And many, many others — with lots in active
development!
41 Big Data.key - February 18, 2016
Data science use cases
• So, where is data science being used?
• And how can we apply it to our businesses?
42 Big Data.key - February 18, 2016
A/B testing
• Find out what your users respond to
• Try two (or more) different versions of your Web site
• Compare to see which one has greater conversions
(i.e., e-commerce success)
• Use the better one… and then do another
experiment, ad infinitum
43 Big Data.key - February 18, 2016
44 Big Data.key - February 18, 2016
45 Big Data.key - February 18, 2016
46
46 Big Data.key - February 18, 2016
Correlations!
• Amazon is one of the most successful data-science
shops
• They’re always collecting information on what
people look at and buy — and they suggest other
products based on that behavior
• How often are they right? (Very often, actually)
47 Big Data.key - February 18, 2016
Fraud detection
• What behavior is correlated with a stolen credit
card?
• What language is correlated with a research paper
that was already written and submitted?
48 Big Data.key - February 18, 2016
Interact with data
• Visualizations provide us (humans) with insights
• Many data scientists spend their time helping
others create powerful, useful visualizations
• GIS (geographic information systems) allow us to
take data, and put it on maps. Some maps are
event interactive, letting us explore data in new
ways
49 Big Data.key - February 18, 2016
Add GIS, and create maps
• https://openaccess9000.cartodb.com/viz/
3459b348-8212-11e5-b022-0e8c56e2ffdb/
public_map
50 Big Data.key - February 18, 2016
This fire hydrant
might earn more than you
From I Quant NY
51 Big Data.key - February 18, 2016
“Half the money I spend on advertising is wasted; the
trouble is I don't know which half.”
— John Wanamaker
52 Big Data.key - February 18, 2016
Advertising
• We can show ads online, and know who has
clicked on them.
• But we can do better: Show ads to the people for
whom they’re most relevant, and most likely to be
appropriate
• How can we do that?
53 Big Data.key - February 18, 2016
Some ideas
• Show people ads based on text searches
• Show people ads based on what they have
explicitly told us
• Show people ads based on what content they have
indicated they like
• Show people ads based on their friends’
preferences and demographics
54 Big Data.key - February 18, 2016
Aha!
• No wonder Google and Facebook are pioneers in
the area of big data
• They’re using enormous amounts of data to display
ads that people like
• And they get lots of additional data points every
day, thanks to searches and “likes”
55 Big Data.key - February 18, 2016
Data sets
• UCI’s machine learning data set
• https://archive.ics.uci.edu/ml/datasets/Housing
• Newsletter with new data sets:
• http://tinyletter.com/data-is-plural/
56 Big Data.key - February 18, 2016
Really big data
• What do we do when the data is too big?
• What if it will take too long to process, or the data is
too big to store on a single machine?
• Then we call in the truly big guns — distributed
processing systems
57 Big Data.key - February 18, 2016
Map-reduce
• map-reduce has been around for decades on
individual computers
• But only now (thanks to Google’s implementation for
distributed systems), everyone wants to use it
• map: apply a function to every element of a sequence
• reduce: turn a sequence of values into single (or small)
value
• Not all data can be broken apart easily!
58 Big Data.key - February 18, 2016
• Create a Hadoop cluster, including storage of the
data you want to understand there
• Run a map-reduce query on your data — apply a
function to it (e..g, do you contain the phrase
“machine learning”) and then reduce into an HTML
page
• Use virtual machines in the cloud to make your
cluster bigger or smaller, as necessary
59 Big Data.key - February 18, 2016
• More modern, real-time, in-memory analysis system
• Open-source system that’s increasingly popular
• Built on the same filesystem as Hadoop
• Connections from Java, Python, R
• Has a suite of highly parallel machine-learning models
• Because your data is in memory (and split across
multiple virtual machines), it runs much faster
60 Big Data.key - February 18, 2016
How old are you?
• http://how-old.net
61 Big Data.key - February 18, 2016
Thanks!
Any questions?
• You can always find me at:
• reuven@lerner.co.il
• http://www.lerner.co.il/
• http://blog.lerner.co.il/
• http://lerner.co.il/newsletter
• @reuvenmlerner on Twitter
62 Big Data.key - February 18, 2016

Big Data — Your new best friend

  • 1.
    Big Data:
 Your NewBest Friend Reuven M. Lerner, PhD MegaComm 2016 • February 18th, 2016 1 Big Data.key - February 18, 2016
  • 2.
    Who am I? •Long-time programmer, consultant, trainer • Python, Git, PostgreSQL, Ruby • Linux Journal columnist 2 2 Big Data.key - February 18, 2016
  • 3.
    My stuff • Newsletter:http://lerner.co.il/newsletter • Blog: http://blog.lerner.co.il/ • Daily Tech Video: http://dailytechvideo.com/ • Or @DailyTechVideo on Twitter • Mandarin Weekly: http://MandarinWeekly.com • Or @MandarinWeekly on Twitter 3 3 Big Data.key - February 18, 2016
  • 4.
    Elections! • Israel hadelections last year • The United States has elections this year • Rumor has it, the world contains some other countries, many of which also hold elections 4 4 Big Data.key - February 18, 2016
  • 5.
    Polls • Before anelection, politicians, reports, and political junkies (like me) look at the polls. • We want to know who is ahead, and who is behind • The politicians want to know which groups like (and dislike) them, so that they can focus their rhetoric and campaigning 5 5 Big Data.key - February 18, 2016
  • 6.
    Are the pollsalways right? 6 6 Big Data.key - February 18, 2016
  • 7.
    7 7 Big Data.key- February 18, 2016
  • 8.
    8 8 Big Data.key- February 18, 2016
  • 9.
    Polls are statisticalmodels • Polls use math to predict the likelihood of a particular outcome, based on a number of inputs • Models are toy versions of reality • They allow us to explore and understand reality, and should bear some connection to it • But there will always be a distinction between a model and the real world 9 9 Big Data.key - February 18, 2016
  • 10.
    Models are important! •They allow us to explore, understand the world • They enable us to make predictions • They reduce costs, and allow us to do things that are otherwise impossible or unethical • 2013 Nobel Prize in Chemistry — for scientists who engaged in modeling of chemistry 10 10 Big Data.key - February 18, 2016
  • 11.
    Testing models • Electionsare unusual: You only have one shot at testing your model to see if it’s accurate • But in your business, you can create and test models every day, modifying the number, type, and weights of the inputs 11 11 Big Data.key - February 18, 2016
  • 12.
    Big data • 90%of the data ever created was generated in the last two years (according to IBM): • Writing, video, audio • Travel, e-commerce, electricity use, phone calls • Metadata, as well • Maybe people aren’t just numbers… but given how often we’re quantified, we’re not that far away 12 12 Big Data.key - February 18, 2016
  • 13.
    13 13 Big Data.key- February 18, 2016
  • 14.
    14 14 Big Data.key- February 18, 2016
  • 15.
    But numbers aregood (if you’re a computer) • Modern computers can hold billions of them • Store not only information about people, but their characteristics and traits, as well as dates and times 15 15 Big Data.key - February 18, 2016
  • 16.
    Your business • Whenyou make business decisions, what factors are you considering? • Are you trying to check all of the possible correlations, across all of the data? • Or are you sampling, and hoping that your sample is an accurate and representative one? 16 16 Big Data.key - February 18, 2016
  • 17.
    • Your businessis now collecting lots and lots of data • Who is buying your products and services? • How often do they visit your Web site? • Which of your e-mail messages do they open? • What do they buy? • How old are they, and where do they live? Enter big data! 17 17 Big Data.key - February 18, 2016
  • 18.
    Why “big” data? •It sounds sophisticated and high-tech. • There really is a lot of it. • Often, there’s more than we can fit (or process) on a single computer • But often, it’s not really that big 18 18 Big Data.key - February 18, 2016
  • 19.
    Enter data science •Data scientists come up with ways to turn raw data into useful information • They create and use models to find correlations among the many pieces of data you’re collecting • They can help you use these correlations to improve your marketing, sales, and production 19 19 Big Data.key - February 18, 2016
  • 20.
    What is datascience? A person employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision- making. — Oxford Dictionary 20 Big Data.key - February 18, 2016
  • 21.
    More realistically… Data scientist(noun): Person who is better at statistics than any soft‐ ware engineer and better at software engineering than any statistician. — Josh Wills 21 Big Data.key - February 18, 2016
  • 22.
    Graphically… From Drew Conway,2010 22 Big Data.key - February 18, 2016
  • 23.
    Look for correlations •Data scientists look for correlations • Using those correlations, we know where we have been successful (and not) • These can be interesting, useful, or crucial • Being able to analyze lots of factors, and thus find correlations in them, allows our models to be more sophisticated — and also predictive 23 23 Big Data.key - February 18, 2016
  • 24.
    24 Big Data.key- February 18, 2016
  • 25.
  • 26.
    Data scientists’ tools •Programming languages + libraries • Data sets • Machine learning • Distributed processing systems 26 Big Data.key - February 18, 2016
  • 27.
    Programming languages • R •Python • Julia • Clojure 27 Big Data.key - February 18, 2016
  • 28.
    Data sets • Yourown • Public ones 28 Big Data.key - February 18, 2016
  • 29.
    What do datasets look like? • Excel spreadsheets • CSV files • Multiple CSV files (e.g., separated by date) • Databases you can clone — but this is rare 29 Big Data.key - February 18, 2016
  • 30.
    Cleaning the data •Remove bad, incomplete data • Remove data that isn’t relevant for the investigation you’re doing • But don’t remove too much, ruining your data! 30 Big Data.key - February 18, 2016
  • 31.
    Machine learning • Thecomputer can learn to categorize things as well as humans do • Then, when given new data, it can decide into which category to put the new item 31 Big Data.key - February 18, 2016
  • 32.
    Spam filters • Spamfilters use a simple form of machine learning • Is a particular e-mail message spam? • Check the contents, using a variety of factors • If the factors make this document similar to other spam documents, then mark it as spam 32 Big Data.key - February 18, 2016
  • 33.
    Aha! • Wondering whye-mail from certain people always gets put into the “junk” e-mail box? • Because those people send mail that looks (to the machine-learning system) too much like junk • Mark the messages as not being junk, so your spam-control system can learn over time 33 Big Data.key - February 18, 2016
  • 34.
    Experience is important •In people, learning is a matter of experience • Machine learning is all about computers also gaining that experience 34 Big Data.key - February 18, 2016
  • 35.
    35 Big Data.key- February 18, 2016
  • 36.
    36 Big Data.key- February 18, 2016
  • 37.
    37 Big Data.key- February 18, 2016
  • 38.
    38 Big Data.key- February 18, 2016
  • 39.
    Models • Machine learningemploys many models • Each model uses different techniques to train the computer into which categories data should be put • Supervised vs. unsupervised learning • The computer can then be given new data 39 Big Data.key - February 18, 2016
  • 40.
    Example: K nearest neighbors •One common machine-learning algorithm finds the closest k (a number) items to a new piece of data • We then have an election — to which category does most existing data belong? • Our new data point joins the majority category 40 Big Data.key - February 18, 2016
  • 41.
    Lots of othermodels • Linear regression • Logistic regression • Neural networks • Deep learning • K-means clustering • And many, many others — with lots in active development! 41 Big Data.key - February 18, 2016
  • 42.
    Data science usecases • So, where is data science being used? • And how can we apply it to our businesses? 42 Big Data.key - February 18, 2016
  • 43.
    A/B testing • Findout what your users respond to • Try two (or more) different versions of your Web site • Compare to see which one has greater conversions (i.e., e-commerce success) • Use the better one… and then do another experiment, ad infinitum 43 Big Data.key - February 18, 2016
  • 44.
    44 Big Data.key- February 18, 2016
  • 45.
    45 Big Data.key- February 18, 2016
  • 46.
    46 46 Big Data.key- February 18, 2016
  • 47.
    Correlations! • Amazon isone of the most successful data-science shops • They’re always collecting information on what people look at and buy — and they suggest other products based on that behavior • How often are they right? (Very often, actually) 47 Big Data.key - February 18, 2016
  • 48.
    Fraud detection • Whatbehavior is correlated with a stolen credit card? • What language is correlated with a research paper that was already written and submitted? 48 Big Data.key - February 18, 2016
  • 49.
    Interact with data •Visualizations provide us (humans) with insights • Many data scientists spend their time helping others create powerful, useful visualizations • GIS (geographic information systems) allow us to take data, and put it on maps. Some maps are event interactive, letting us explore data in new ways 49 Big Data.key - February 18, 2016
  • 50.
    Add GIS, andcreate maps • https://openaccess9000.cartodb.com/viz/ 3459b348-8212-11e5-b022-0e8c56e2ffdb/ public_map 50 Big Data.key - February 18, 2016
  • 51.
    This fire hydrant mightearn more than you From I Quant NY 51 Big Data.key - February 18, 2016
  • 52.
    “Half the moneyI spend on advertising is wasted; the trouble is I don't know which half.” — John Wanamaker 52 Big Data.key - February 18, 2016
  • 53.
    Advertising • We canshow ads online, and know who has clicked on them. • But we can do better: Show ads to the people for whom they’re most relevant, and most likely to be appropriate • How can we do that? 53 Big Data.key - February 18, 2016
  • 54.
    Some ideas • Showpeople ads based on text searches • Show people ads based on what they have explicitly told us • Show people ads based on what content they have indicated they like • Show people ads based on their friends’ preferences and demographics 54 Big Data.key - February 18, 2016
  • 55.
    Aha! • No wonderGoogle and Facebook are pioneers in the area of big data • They’re using enormous amounts of data to display ads that people like • And they get lots of additional data points every day, thanks to searches and “likes” 55 Big Data.key - February 18, 2016
  • 56.
    Data sets • UCI’smachine learning data set • https://archive.ics.uci.edu/ml/datasets/Housing • Newsletter with new data sets: • http://tinyletter.com/data-is-plural/ 56 Big Data.key - February 18, 2016
  • 57.
    Really big data •What do we do when the data is too big? • What if it will take too long to process, or the data is too big to store on a single machine? • Then we call in the truly big guns — distributed processing systems 57 Big Data.key - February 18, 2016
  • 58.
    Map-reduce • map-reduce hasbeen around for decades on individual computers • But only now (thanks to Google’s implementation for distributed systems), everyone wants to use it • map: apply a function to every element of a sequence • reduce: turn a sequence of values into single (or small) value • Not all data can be broken apart easily! 58 Big Data.key - February 18, 2016
  • 59.
    • Create aHadoop cluster, including storage of the data you want to understand there • Run a map-reduce query on your data — apply a function to it (e..g, do you contain the phrase “machine learning”) and then reduce into an HTML page • Use virtual machines in the cloud to make your cluster bigger or smaller, as necessary 59 Big Data.key - February 18, 2016
  • 60.
    • More modern,real-time, in-memory analysis system • Open-source system that’s increasingly popular • Built on the same filesystem as Hadoop • Connections from Java, Python, R • Has a suite of highly parallel machine-learning models • Because your data is in memory (and split across multiple virtual machines), it runs much faster 60 Big Data.key - February 18, 2016
  • 61.
    How old areyou? • http://how-old.net 61 Big Data.key - February 18, 2016
  • 62.
    Thanks! Any questions? • Youcan always find me at: • reuven@lerner.co.il • http://www.lerner.co.il/ • http://blog.lerner.co.il/ • http://lerner.co.il/newsletter • @reuvenmlerner on Twitter 62 Big Data.key - February 18, 2016