LET‟S GET REAL
THE REAL IMPACT OF BIG DATA
ClaudiuPopa
@datarisk
THE DEFINITION
 Short for big data analytics
 The trend towards larger data sets allowing
correlations to be found to spot business
trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and
determine real-time roadway traffic conditions
THE BUZZWORD
 Marketing term Increasingly used to evoke
computational power and large scale algorithms
that can turn mountain of data into usable
information and intelligent business decisions
 Modern day version of Big Brother. Diverse
transactional data that can impact privacy
MENU
 PROMISE
 REALITY
 OPPORTUNITY
 CHALLENGE
 SOLUTION
The Promise of Big Data
Sexiest Job of the 21st Century?
Predicting crime/Fraud detection Timing investments Mining astro data for E.T.
Productivity loss of stressful travel Call centre analytics Dynamic ticket pricing
Guessing demand for better service E-com & customer service Capture recurring revenue
Using sensor data for efficiency Barack Obama‟s campaign Medical research/treatment
Reduced pushback campaigns Crowdfund / Crowdsource Open pit data mining
Mapping the sex trade Disease eradication Tracking endangered species
Big benefits
Source: Fast Company. Photos: Jeff Brown. Illustrations: Justin Mezzell
can‟t we agree on a definition?
2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
More than Mining: The Reality of Big Data
More fun than mining for BitCoins
Full graph back to the 1500s saved at http://Popa.ca/GraphInfo
Some get it...
others, not so much.
“
”
Seeking clarity? Given the signal to
noise ratio, big data itself appears
to be telling us that working raw
numbers at ever greater scale to
force out some answers may not
be the real achievement here.
Instead, it is perhaps in the
elegance with which it steers us
towards the right questions to ask.
big data is often personal
2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
but socmed is a big part of the risk
Do custodians have a responsibility?
 Google Person Finder
 Flu Trends
 Dengue Trends
 Crisis Maps
 MOOC metrics
 Sex trade tracking
big data growing big time, and fast
Source #1: www.sourcelink.com/blog/guest- author-series/2012/08/13/if-you-build-it-they-will-come
the 3 dimensions of Big Data
 volume
is there enough data? Too much?
 variety
how diverse is the data?
 velocity
how rapidly must it be processed?
the 6 new aspects of Big Data
 validity
what is the threshold for applicability?
 verifiability
will we continue to trust the input source?
 variability
what‟s the tolerance for variation?
 veracity
is it useful for predicting business value?
 value
does it offer valuable insight?
 visualization
can it be meaningfully represented?
Big Value
 silos
are departmental silos good or bad?
 timely exploitation
volume and velocity = complexity
 system integration
the real enablers need to „get‟ privacy
“
”
Visualization is the most likely V
to save big data by making
personalized information
available to owners and not
custodians in natural, frictionless
ways.
Full disclosure: why I care
 integrity
if you can‟t trust the data, immense effort and expense are lost
 confidentiality
the impact of potential breaches is vastly increased
 availability
input availability impacts value and velocity in particular
 privacy
impacted by any breach of data sets/clusters with PII
Hint: it’s not because Gartner said the space is worth up to $3.8 Trillion
The First 4 Vs: Where‟s the Privacy?
Source: Data Science Central
too valuable to discard?
Source: http://jeffhurtblog.com/2012/07/20/three-vs-of-big-data-as-applied-conferences/
Visualizing Privacy in Big Data
Are we there yet?
 Utah Data Center will handle yottabytes of data
 TB: 1012 bytes, Exa:1018, Zetta: 1021, Yotta: 1024
 Each Boeing jet engine creates 20TB/hr
 Facebook grows by 500TB/day.
Ref: Visual.ly infographic on big data
Scale
 PB=1015: All printed material in existence in 1995 or
ONE SECOND of data generated at CERN/LHC
 1 exa: All data created on the Internet each day
 SKA will collect „a few exabytes‟ of data each day
processing 10PB/hr and producing up to 100x
CERN‟s output each year
 Using terabyte drives, a yotta centre would be as
large as the states of Delaware and Rhode Island
 Using SDXC cards, it would be as large as the
Great Pyramid of Giza
 NASA already stores 32pb of climate data
Not just about processing
 ...except in the hands of the first indiscriminate
seller and unscrupulous buyer
 unless all elements are preserved the fear is that it
will be more difficult to find gold
 and legislation may force companies to reveal
what they hold and what they share
Data without analysis is useless
Ref: California’s Right to Know bill AB 1291 demands accountability from huge data hoarding firms
 Big Data in medicine often revolves around gee
sequencing and biosamples, vital records, insurance claims.
Data use and reuse has created grave concerns about
privacy and informed consent.
 geotagging and social media fuel the debate over big
data privacy. Does informed mean consent?
 personalized searches are automatically recorded and
mined in every way possible. Crowd wisdom and individual
preferences create a climate of unease among web
search users
Overanalyzed data creates concerns
Source: http://journalistsresource.org/studies/economics/business/what-big-data-research-roundup#
 Internet Census 2012 was an unauthorized deep dive into
the security of all Internet connected devices
 It infected vulnerable devices with a custom binary and
used their processing power to expand the scale of scans
 It found over 35million vulnerable devices and millions of
others that should not be online at all
 The research data was analyzed and anonymously
published online after the 2-year project was completed
 9+ TB collected and analyzed, 52bn ICMP pings,
180bn service probes, 71 bn ports tested
Unauthorized Internet scale analysis
Source: http://internetcensus2012.bitbucket.org image: 420,000 Carna Botnet locations
 anticipatory systems like Google Now already have
a positive impact on individual productivity
 crowdfunding has a global economic impact and
an even bigger innovative footprint
 crowdsourcing assists with investigations and
research, especially as small data is tapped
Observing people through shared data
 20 million geolocated tweets during 4-day event
 grocery shopping peaks night before hurricane
 night life up after it
 Manhattan skew
 Impact area shows
lowest tweet rates
 signal problem may
undermine big data
value
Social media & major event correlation
Source: Rutgers Twitter/Foursquare Sandy Study 2012 http://popa.ca/SandyBigData
 global data doubles every 2 years, but only 0.5% is ever
analyzed
 strength comes from pooling the data, but value is in
individualizing findings
 how can personal analytics be custom-fitted to benefit
individuals without first impacting privacy?
Big data‟s promise is not in aggregation
 Hundreds of millions of devices vulnerable globally
 95% unpatched and vulnerable
 5% of those patched are still vulnerable to zero-days
 study shows that 75% are at least 6 mths behind
 but it also shows that the focus isn‟t just on one aspect.
It‟s a massive systemic issue that was allowed to grow
into a global threat.
 Government issued a public recommendation to
discontinue the use of Java because it is unsafe
A sobering look at the Java threat
Source: 2013 Websense http://popa.ca/JavaSecurityPie
“
”
If you can crunch it, more data
means better results. The
caveats are that you get
proportionately less information
by volume and quality tends to
decrease over time.
The Opportunity of Big Data
 Since data storage will reach a practical
asymptotic maximum, we can distribute resources
 This will help with data quality, input filtering, metrics
and statistics, layered privacy filtering, reporting
filters, data siloing and segregation, neural net-style
learning to maximize efficiency, etc.
Redefining organized, as in crime
 every year, hundreds of millions of records are
siphoned from diverse databases globally
 SIN/SSNs, Credit Card Data, home addresses all
amount to one thing: identities
 The vast majority of that data has to date gone
unexploited, likely due to analytic challenges
IBRF: ID business requirements 1st!
1. initial focus on customer-centric outcomes
2. enterprise-wide big data blueprint
3. get near-term results from existing data
4. build analytics capabilities on business priorities
5. create a business case on measurable outcomes
2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
Who isn‟t already mining it?
Thepicturechangeseveryminute
Responsible Visualization of Big Data
 Data discovery is ideally positioned to identify
PII, create filters & create healthy correlations
 Data quality visuals are opportunities to form
segments or hierarchies & create aggregate logic
 Storytelling often identifies outliers. Expertly narrate
correlations and patterns, de-identify exceptions
Where to look for Privacy in Big Data
 Dashboards are used to present meaningful data
and should be adequately tested for compliance
 Tools can be tailored to audiences and should be
specialized to eliminate undesirable inferences
 Trends and predictions result from proper data
analysis. This is where meaning becomes evident
“
”
Hyped indiscriminately and
handled inappropriately, big
data analytics can be more of
a liability than an opportunity
to derive rich information
through intelligent refinement.
The Challenge of Big Data
 Does logging in with social media accounts
constitute consent?
 Will aggregation and data masking still lead to
personally identifiable information?
 How can data privacy filtering be guaranteed?
Big Data example: the Click dataset
 Objective: “To study the structure and dynamics of
Web traffic networks”
 53.5 billion click anonymized dataset @IndianaU
 Data collected includes referrer, timestamp, URL
 Does sanitization == anonymization == privacy?
Source: CNetS: http://cnets.indiana.edu
Open, Distributed DIY Big Data Tools
 D3: Data Driven
Documents
 GitHub | SourceForge
 Hadoop/MapReduce
 Amazon cloud
 Open source grids
 Mechanical Turk?
Source: Wikipedia Recent Changes Map and wikistream projects
Canadian Big Data: The Source
 Detailed metrics showed preference for high end
products
 Move away from $150 items to $650 ones
increased sales 40% in high end electronics
 Notorious for overcollection, the Source actually
does „the consent bit‟ adequately well
Privacy enjoys safe sets
 intrinsic safety in:
meteorology, environmental, physics, astronomy
and other sciences
 innate risk in
connectomics, biological, Internet, behavioural
and sensory data sets
“
”
We simply cannot afford to
entertain the notion that the
proliferation of scattered data
sources is the last bastion of
privacy protection.
 Build privacy into the input data sets
 Use simple filtering for large data sets & output
 Build algorithms to ensure irreversibility of privacy
 Try to break it!
Technical solution of Big Data privacy
Privacy must be tack[l]ed [head-]on
“
”
Data with the potential of
being personally identifiable
should be treated with the
same veracity as dirty input.
7 Steps to Building your own
Healthy Information Ecosystem
Articulate your vision
Put your stop orders in place
Assign roles and accountabilities
Create processes to manage it
Build controls and code standards from the bottom up
Prioritize data ownership, integrity and classification
Implement layered and automated audits
Big Data Leadership
 Embrace openness, build on what works
 Adopt standards for process and technology
 Draft progressive legislation (Model CASL, PCI and
even CP laws)
 Encourage awareness, promote accountability
 Applaud and showcase responsible innovation
 Put forward important notions of information life
cycle, data ownership, and privacy compliance
Big Data Links
 Free course http://popa.ca/BigDataCourse (Coursera)
 More from http://bigdatauniversity.com/courses/
 This presentation: http://linkedIn.ClaudiuPopa.com
 Visualization gallery: http://datavis.ca
 InformationisBeautiful.net + Awards
 HowBigReally.com / HowManyReally.com
 The signal and the noise (Nate Silver @ amazon.ca)
 Sharing data mining policies
 Demonstrating fair use using audits
 Caring about the purveyor spectrum
 When it gets easy, it may be too late
Big Data open discussion
Follow Twitter.ClaudiuPopa.com
Read Subscribe.ClaudiuPopa.com
Connect LinkedIn.ClaudiuPopa.com

The REAL Impact of Big Data on Privacy

  • 1.
    LET‟S GET REAL THEREAL IMPACT OF BIG DATA ClaudiuPopa @datarisk
  • 3.
    THE DEFINITION  Shortfor big data analytics  The trend towards larger data sets allowing correlations to be found to spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions
  • 4.
    THE BUZZWORD  Marketingterm Increasingly used to evoke computational power and large scale algorithms that can turn mountain of data into usable information and intelligent business decisions  Modern day version of Big Brother. Diverse transactional data that can impact privacy
  • 5.
    MENU  PROMISE  REALITY OPPORTUNITY  CHALLENGE  SOLUTION
  • 6.
    The Promise ofBig Data Sexiest Job of the 21st Century? Predicting crime/Fraud detection Timing investments Mining astro data for E.T. Productivity loss of stressful travel Call centre analytics Dynamic ticket pricing Guessing demand for better service E-com & customer service Capture recurring revenue Using sensor data for efficiency Barack Obama‟s campaign Medical research/treatment Reduced pushback campaigns Crowdfund / Crowdsource Open pit data mining Mapping the sex trade Disease eradication Tracking endangered species
  • 7.
    Big benefits Source: FastCompany. Photos: Jeff Brown. Illustrations: Justin Mezzell
  • 8.
    can‟t we agreeon a definition? 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
  • 9.
    More than Mining:The Reality of Big Data
  • 10.
    More fun thanmining for BitCoins Full graph back to the 1500s saved at http://Popa.ca/GraphInfo
  • 11.
  • 12.
  • 13.
    “ ” Seeking clarity? Giventhe signal to noise ratio, big data itself appears to be telling us that working raw numbers at ever greater scale to force out some answers may not be the real achievement here. Instead, it is perhaps in the elegance with which it steers us towards the right questions to ask.
  • 14.
    big data isoften personal 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
  • 15.
    but socmed isa big part of the risk
  • 16.
    Do custodians havea responsibility?  Google Person Finder  Flu Trends  Dengue Trends  Crisis Maps  MOOC metrics  Sex trade tracking
  • 18.
    big data growingbig time, and fast Source #1: www.sourcelink.com/blog/guest- author-series/2012/08/13/if-you-build-it-they-will-come
  • 19.
    the 3 dimensionsof Big Data  volume is there enough data? Too much?  variety how diverse is the data?  velocity how rapidly must it be processed?
  • 20.
    the 6 newaspects of Big Data  validity what is the threshold for applicability?  verifiability will we continue to trust the input source?  variability what‟s the tolerance for variation?  veracity is it useful for predicting business value?  value does it offer valuable insight?  visualization can it be meaningfully represented?
  • 21.
    Big Value  silos aredepartmental silos good or bad?  timely exploitation volume and velocity = complexity  system integration the real enablers need to „get‟ privacy
  • 22.
    “ ” Visualization is themost likely V to save big data by making personalized information available to owners and not custodians in natural, frictionless ways.
  • 23.
    Full disclosure: whyI care  integrity if you can‟t trust the data, immense effort and expense are lost  confidentiality the impact of potential breaches is vastly increased  availability input availability impacts value and velocity in particular  privacy impacted by any breach of data sets/clusters with PII Hint: it’s not because Gartner said the space is worth up to $3.8 Trillion
  • 24.
    The First 4Vs: Where‟s the Privacy? Source: Data Science Central
  • 25.
    too valuable todiscard? Source: http://jeffhurtblog.com/2012/07/20/three-vs-of-big-data-as-applied-conferences/
  • 26.
  • 27.
    Are we thereyet?  Utah Data Center will handle yottabytes of data  TB: 1012 bytes, Exa:1018, Zetta: 1021, Yotta: 1024  Each Boeing jet engine creates 20TB/hr  Facebook grows by 500TB/day. Ref: Visual.ly infographic on big data
  • 28.
    Scale  PB=1015: Allprinted material in existence in 1995 or ONE SECOND of data generated at CERN/LHC  1 exa: All data created on the Internet each day  SKA will collect „a few exabytes‟ of data each day processing 10PB/hr and producing up to 100x CERN‟s output each year
  • 29.
     Using terabytedrives, a yotta centre would be as large as the states of Delaware and Rhode Island  Using SDXC cards, it would be as large as the Great Pyramid of Giza  NASA already stores 32pb of climate data Not just about processing
  • 30.
     ...except inthe hands of the first indiscriminate seller and unscrupulous buyer  unless all elements are preserved the fear is that it will be more difficult to find gold  and legislation may force companies to reveal what they hold and what they share Data without analysis is useless Ref: California’s Right to Know bill AB 1291 demands accountability from huge data hoarding firms
  • 31.
     Big Datain medicine often revolves around gee sequencing and biosamples, vital records, insurance claims. Data use and reuse has created grave concerns about privacy and informed consent.  geotagging and social media fuel the debate over big data privacy. Does informed mean consent?  personalized searches are automatically recorded and mined in every way possible. Crowd wisdom and individual preferences create a climate of unease among web search users Overanalyzed data creates concerns Source: http://journalistsresource.org/studies/economics/business/what-big-data-research-roundup#
  • 32.
     Internet Census2012 was an unauthorized deep dive into the security of all Internet connected devices  It infected vulnerable devices with a custom binary and used their processing power to expand the scale of scans  It found over 35million vulnerable devices and millions of others that should not be online at all  The research data was analyzed and anonymously published online after the 2-year project was completed  9+ TB collected and analyzed, 52bn ICMP pings, 180bn service probes, 71 bn ports tested Unauthorized Internet scale analysis Source: http://internetcensus2012.bitbucket.org image: 420,000 Carna Botnet locations
  • 33.
     anticipatory systemslike Google Now already have a positive impact on individual productivity  crowdfunding has a global economic impact and an even bigger innovative footprint  crowdsourcing assists with investigations and research, especially as small data is tapped Observing people through shared data
  • 34.
     20 milliongeolocated tweets during 4-day event  grocery shopping peaks night before hurricane  night life up after it  Manhattan skew  Impact area shows lowest tweet rates  signal problem may undermine big data value Social media & major event correlation Source: Rutgers Twitter/Foursquare Sandy Study 2012 http://popa.ca/SandyBigData
  • 35.
     global datadoubles every 2 years, but only 0.5% is ever analyzed  strength comes from pooling the data, but value is in individualizing findings  how can personal analytics be custom-fitted to benefit individuals without first impacting privacy? Big data‟s promise is not in aggregation
  • 36.
     Hundreds ofmillions of devices vulnerable globally  95% unpatched and vulnerable  5% of those patched are still vulnerable to zero-days  study shows that 75% are at least 6 mths behind  but it also shows that the focus isn‟t just on one aspect. It‟s a massive systemic issue that was allowed to grow into a global threat.  Government issued a public recommendation to discontinue the use of Java because it is unsafe A sobering look at the Java threat Source: 2013 Websense http://popa.ca/JavaSecurityPie
  • 37.
    “ ” If you cancrunch it, more data means better results. The caveats are that you get proportionately less information by volume and quality tends to decrease over time.
  • 38.
    The Opportunity ofBig Data  Since data storage will reach a practical asymptotic maximum, we can distribute resources  This will help with data quality, input filtering, metrics and statistics, layered privacy filtering, reporting filters, data siloing and segregation, neural net-style learning to maximize efficiency, etc.
  • 39.
    Redefining organized, asin crime  every year, hundreds of millions of records are siphoned from diverse databases globally  SIN/SSNs, Credit Card Data, home addresses all amount to one thing: identities  The vast majority of that data has to date gone unexploited, likely due to analytic challenges
  • 40.
    IBRF: ID businessrequirements 1st! 1. initial focus on customer-centric outcomes 2. enterprise-wide big data blueprint 3. get near-term results from existing data 4. build analytics capabilities on business priorities 5. create a business case on measurable outcomes 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
  • 41.
  • 42.
  • 44.
    Responsible Visualization ofBig Data  Data discovery is ideally positioned to identify PII, create filters & create healthy correlations  Data quality visuals are opportunities to form segments or hierarchies & create aggregate logic  Storytelling often identifies outliers. Expertly narrate correlations and patterns, de-identify exceptions
  • 45.
    Where to lookfor Privacy in Big Data  Dashboards are used to present meaningful data and should be adequately tested for compliance  Tools can be tailored to audiences and should be specialized to eliminate undesirable inferences  Trends and predictions result from proper data analysis. This is where meaning becomes evident
  • 46.
    “ ” Hyped indiscriminately and handledinappropriately, big data analytics can be more of a liability than an opportunity to derive rich information through intelligent refinement.
  • 47.
    The Challenge ofBig Data  Does logging in with social media accounts constitute consent?  Will aggregation and data masking still lead to personally identifiable information?  How can data privacy filtering be guaranteed?
  • 48.
    Big Data example:the Click dataset  Objective: “To study the structure and dynamics of Web traffic networks”  53.5 billion click anonymized dataset @IndianaU  Data collected includes referrer, timestamp, URL  Does sanitization == anonymization == privacy? Source: CNetS: http://cnets.indiana.edu
  • 49.
    Open, Distributed DIYBig Data Tools  D3: Data Driven Documents  GitHub | SourceForge  Hadoop/MapReduce  Amazon cloud  Open source grids  Mechanical Turk? Source: Wikipedia Recent Changes Map and wikistream projects
  • 50.
    Canadian Big Data:The Source  Detailed metrics showed preference for high end products  Move away from $150 items to $650 ones increased sales 40% in high end electronics  Notorious for overcollection, the Source actually does „the consent bit‟ adequately well
  • 51.
    Privacy enjoys safesets  intrinsic safety in: meteorology, environmental, physics, astronomy and other sciences  innate risk in connectomics, biological, Internet, behavioural and sensory data sets
  • 52.
    “ ” We simply cannotafford to entertain the notion that the proliferation of scattered data sources is the last bastion of privacy protection.
  • 53.
     Build privacyinto the input data sets  Use simple filtering for large data sets & output  Build algorithms to ensure irreversibility of privacy  Try to break it! Technical solution of Big Data privacy
  • 54.
    Privacy must betack[l]ed [head-]on
  • 55.
    “ ” Data with thepotential of being personally identifiable should be treated with the same veracity as dirty input.
  • 56.
    7 Steps toBuilding your own Healthy Information Ecosystem Articulate your vision Put your stop orders in place Assign roles and accountabilities Create processes to manage it Build controls and code standards from the bottom up Prioritize data ownership, integrity and classification Implement layered and automated audits
  • 57.
    Big Data Leadership Embrace openness, build on what works  Adopt standards for process and technology  Draft progressive legislation (Model CASL, PCI and even CP laws)  Encourage awareness, promote accountability  Applaud and showcase responsible innovation  Put forward important notions of information life cycle, data ownership, and privacy compliance
  • 58.
    Big Data Links Free course http://popa.ca/BigDataCourse (Coursera)  More from http://bigdatauniversity.com/courses/  This presentation: http://linkedIn.ClaudiuPopa.com  Visualization gallery: http://datavis.ca  InformationisBeautiful.net + Awards  HowBigReally.com / HowManyReally.com  The signal and the noise (Nate Silver @ amazon.ca)
  • 59.
     Sharing datamining policies  Demonstrating fair use using audits  Caring about the purveyor spectrum  When it gets easy, it may be too late Big Data open discussion
  • 60.

Editor's Notes

  • #2 Let’s Get Real: The Real Impacts of Big DataClaudiu Popa, CIPP/US, President, Informatica CorporationIn this session, we will illustrate, using clear examples, what big data means for privacy protection, compliance and, ultimately, individuals. We’ll address the concerns over data aggregation from multiple legitimate sources, along with ways to design adequate controls to protect personal information by segregating it into siloed environments, logical and physical boundaries and policy enforcement. Real-world solutions are provided for what is rapidly becoming a standard way of distilling ever-more information from a universe of sources, including cloud-based services and a diverse array of mobile platforms.What you’ll take away:Security concepts of data aggregationPrivacy concepts of interjurisdictional data transferTechnology concepts based on the use of diverse mobile devices
  • #3 He’s a remarkable achievement. In 2350 or thereabouts, this robot will be able to... except that we’ve already achieved all those milestones.His positronic net was capable of processing sixty-trillion operations per second (teraflops already exceeded by the end of the 90s – now to hundreds of petaflops and soon to 132 exaflops by 2016 or 132 quintillion mathematical operations per second or 1followed by 18 zeros) and had a storage capacity of eight hundred quadrillion bits, which is approximately one-hundred petabytes. 88.81784197 PiBHe was constructed circa 2336 on the planet Omicron Theta. The pebibyte is a standards-based binary multiple (prefix pebi, symbol Pi) of the byte, a unit of digital information storage. The pebibyte unit symbol is PiB.[1]1 pebibyte = 250 bytes = 1125899906842624bytes = 1024 tebibytesThe pebibyte is closely related to the petabyte, which is defined as 1015 bytes = 1000000000000000bytes (see binary prefix).
  • #4 Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
  • #5 Big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions."[5][6][7]
  • #6 The premise is that by accumulating an abundance of data, we have the option of cherry picking what we want, systematizing the approach and discarding the rest in an effort to predict an outcome with a high degree of accuracy. The data revolution will inevitably become personal. The sick thing is that the question is always going to be: do we have enough? How much is enough? Can we add more? Is big data a recipe for data addiction? Because if it is, it is a definite path to abuse. But before we embark on that discussion, it makes sense to ask, what is ‘small data’ right?At what point does the data get anonymized? What does it mean? Can you still uniquely identify people based on certain parameters?