The REAL Impact of Big Data on Privacy


Published on

The awesome promise of Big Data is tempered by the need to protect personal information. Data scientists must expertly navigate the legislative waters and acquire the skills to protect privacy and security. This talk provides enterprise leaders with answers and suggests questions to ask when the time comes to consider the vast opportunities offered by big data.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Let’s Get Real: The Real Impacts of Big DataClaudiu Popa, CIPP/US, President, Informatica CorporationIn this session, we will illustrate, using clear examples, what big data means for privacy protection, compliance and, ultimately, individuals. We’ll address the concerns over data aggregation from multiple legitimate sources, along with ways to design adequate controls to protect personal information by segregating it into siloed environments, logical and physical boundaries and policy enforcement. Real-world solutions are provided for what is rapidly becoming a standard way of distilling ever-more information from a universe of sources, including cloud-based services and a diverse array of mobile platforms.What you’ll take away:Security concepts of data aggregationPrivacy concepts of interjurisdictional data transferTechnology concepts based on the use of diverse mobile devices
  • He’s a remarkable achievement. In 2350 or thereabouts, this robot will be able to... except that we’ve already achieved all those milestones.His positronic net was capable of processing sixty-trillion operations per second (teraflops already exceeded by the end of the 90s – now to hundreds of petaflops and soon to 132 exaflops by 2016 or 132 quintillion mathematical operations per second or 1followed by 18 zeros) and had a storage capacity of eight hundred quadrillion bits, which is approximately one-hundred petabytes. 88.81784197 PiBHe was constructed circa 2336 on the planet Omicron Theta. The pebibyte is a standards-based binary multiple (prefix pebi, symbol Pi) of the byte, a unit of digital information storage. The pebibyte unit symbol is PiB.[1]1 pebibyte = 250 bytes = 1125899906842624bytes = 1024 tebibytesThe pebibyte is closely related to the petabyte, which is defined as 1015 bytes = 1000000000000000bytes (see binary prefix).
  • Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
  • Big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions."[5][6][7]
  • The premise is that by accumulating an abundance of data, we have the option of cherry picking what we want, systematizing the approach and discarding the rest in an effort to predict an outcome with a high degree of accuracy. The data revolution will inevitably become personal. The sick thing is that the question is always going to be: do we have enough? How much is enough? Can we add more? Is big data a recipe for data addiction? Because if it is, it is a definite path to abuse. But before we embark on that discussion, it makes sense to ask, what is ‘small data’ right?At what point does the data get anonymized? What does it mean? Can you still uniquely identify people based on certain parameters?
  • The REAL Impact of Big Data on Privacy

    1. 1. LET‟S GET REAL THE REAL IMPACT OF BIG DATA ClaudiuPopa @datarisk
    2. 2. THE DEFINITION  Short for big data analytics  The trend towards larger data sets allowing correlations to be found to spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions
    3. 3. THE BUZZWORD  Marketing term Increasingly used to evoke computational power and large scale algorithms that can turn mountain of data into usable information and intelligent business decisions  Modern day version of Big Brother. Diverse transactional data that can impact privacy
    5. 5. The Promise of Big Data Sexiest Job of the 21st Century? Predicting crime/Fraud detection Timing investments Mining astro data for E.T. Productivity loss of stressful travel Call centre analytics Dynamic ticket pricing Guessing demand for better service E-com & customer service Capture recurring revenue Using sensor data for efficiency Barack Obama‟s campaign Medical research/treatment Reduced pushback campaigns Crowdfund / Crowdsource Open pit data mining Mapping the sex trade Disease eradication Tracking endangered species
    6. 6. Big benefits Source: Fast Company. Photos: Jeff Brown. Illustrations: Justin Mezzell
    7. 7. can‟t we agree on a definition? 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
    8. 8. More than Mining: The Reality of Big Data
    9. 9. More fun than mining for BitCoins Full graph back to the 1500s saved at
    10. 10. Some get it...
    11. 11. others, not so much.
    12. 12. “ ” Seeking clarity? Given the signal to noise ratio, big data itself appears to be telling us that working raw numbers at ever greater scale to force out some answers may not be the real achievement here. Instead, it is perhaps in the elegance with which it steers us towards the right questions to ask.
    13. 13. big data is often personal 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
    14. 14. but socmed is a big part of the risk
    15. 15. Do custodians have a responsibility?  Google Person Finder  Flu Trends  Dengue Trends  Crisis Maps  MOOC metrics  Sex trade tracking
    16. 16. big data growing big time, and fast Source #1: author-series/2012/08/13/if-you-build-it-they-will-come
    17. 17. the 3 dimensions of Big Data  volume is there enough data? Too much?  variety how diverse is the data?  velocity how rapidly must it be processed?
    18. 18. the 6 new aspects of Big Data  validity what is the threshold for applicability?  verifiability will we continue to trust the input source?  variability what‟s the tolerance for variation?  veracity is it useful for predicting business value?  value does it offer valuable insight?  visualization can it be meaningfully represented?
    19. 19. Big Value  silos are departmental silos good or bad?  timely exploitation volume and velocity = complexity  system integration the real enablers need to „get‟ privacy
    20. 20. “ ” Visualization is the most likely V to save big data by making personalized information available to owners and not custodians in natural, frictionless ways.
    21. 21. Full disclosure: why I care  integrity if you can‟t trust the data, immense effort and expense are lost  confidentiality the impact of potential breaches is vastly increased  availability input availability impacts value and velocity in particular  privacy impacted by any breach of data sets/clusters with PII Hint: it’s not because Gartner said the space is worth up to $3.8 Trillion
    22. 22. The First 4 Vs: Where‟s the Privacy? Source: Data Science Central
    23. 23. too valuable to discard? Source:
    24. 24. Visualizing Privacy in Big Data
    25. 25. Are we there yet?  Utah Data Center will handle yottabytes of data  TB: 1012 bytes, Exa:1018, Zetta: 1021, Yotta: 1024  Each Boeing jet engine creates 20TB/hr  Facebook grows by 500TB/day. Ref: infographic on big data
    26. 26. Scale  PB=1015: All printed material in existence in 1995 or ONE SECOND of data generated at CERN/LHC  1 exa: All data created on the Internet each day  SKA will collect „a few exabytes‟ of data each day processing 10PB/hr and producing up to 100x CERN‟s output each year
    27. 27.  Using terabyte drives, a yotta centre would be as large as the states of Delaware and Rhode Island  Using SDXC cards, it would be as large as the Great Pyramid of Giza  NASA already stores 32pb of climate data Not just about processing
    28. 28.  ...except in the hands of the first indiscriminate seller and unscrupulous buyer  unless all elements are preserved the fear is that it will be more difficult to find gold  and legislation may force companies to reveal what they hold and what they share Data without analysis is useless Ref: California’s Right to Know bill AB 1291 demands accountability from huge data hoarding firms
    29. 29.  Big Data in medicine often revolves around gee sequencing and biosamples, vital records, insurance claims. Data use and reuse has created grave concerns about privacy and informed consent.  geotagging and social media fuel the debate over big data privacy. Does informed mean consent?  personalized searches are automatically recorded and mined in every way possible. Crowd wisdom and individual preferences create a climate of unease among web search users Overanalyzed data creates concerns Source:
    30. 30.  Internet Census 2012 was an unauthorized deep dive into the security of all Internet connected devices  It infected vulnerable devices with a custom binary and used their processing power to expand the scale of scans  It found over 35million vulnerable devices and millions of others that should not be online at all  The research data was analyzed and anonymously published online after the 2-year project was completed  9+ TB collected and analyzed, 52bn ICMP pings, 180bn service probes, 71 bn ports tested Unauthorized Internet scale analysis Source: image: 420,000 Carna Botnet locations
    31. 31.  anticipatory systems like Google Now already have a positive impact on individual productivity  crowdfunding has a global economic impact and an even bigger innovative footprint  crowdsourcing assists with investigations and research, especially as small data is tapped Observing people through shared data
    32. 32.  20 million geolocated tweets during 4-day event  grocery shopping peaks night before hurricane  night life up after it  Manhattan skew  Impact area shows lowest tweet rates  signal problem may undermine big data value Social media & major event correlation Source: Rutgers Twitter/Foursquare Sandy Study 2012
    33. 33.  global data doubles every 2 years, but only 0.5% is ever analyzed  strength comes from pooling the data, but value is in individualizing findings  how can personal analytics be custom-fitted to benefit individuals without first impacting privacy? Big data‟s promise is not in aggregation
    34. 34.  Hundreds of millions of devices vulnerable globally  95% unpatched and vulnerable  5% of those patched are still vulnerable to zero-days  study shows that 75% are at least 6 mths behind  but it also shows that the focus isn‟t just on one aspect. It‟s a massive systemic issue that was allowed to grow into a global threat.  Government issued a public recommendation to discontinue the use of Java because it is unsafe A sobering look at the Java threat Source: 2013 Websense
    35. 35. “ ” If you can crunch it, more data means better results. The caveats are that you get proportionately less information by volume and quality tends to decrease over time.
    36. 36. The Opportunity of Big Data  Since data storage will reach a practical asymptotic maximum, we can distribute resources  This will help with data quality, input filtering, metrics and statistics, layered privacy filtering, reporting filters, data siloing and segregation, neural net-style learning to maximize efficiency, etc.
    37. 37. Redefining organized, as in crime  every year, hundreds of millions of records are siphoned from diverse databases globally  SIN/SSNs, Credit Card Data, home addresses all amount to one thing: identities  The vast majority of that data has to date gone unexploited, likely due to analytic challenges
    38. 38. IBRF: ID business requirements 1st! 1. initial focus on customer-centric outcomes 2. enterprise-wide big data blueprint 3. get near-term results from existing data 4. build analytics capabilities on business priorities 5. create a business case on measurable outcomes 2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)
    39. 39. Who isn‟t already mining it?
    40. 40. Thepicturechangeseveryminute
    41. 41. Responsible Visualization of Big Data  Data discovery is ideally positioned to identify PII, create filters & create healthy correlations  Data quality visuals are opportunities to form segments or hierarchies & create aggregate logic  Storytelling often identifies outliers. Expertly narrate correlations and patterns, de-identify exceptions
    42. 42. Where to look for Privacy in Big Data  Dashboards are used to present meaningful data and should be adequately tested for compliance  Tools can be tailored to audiences and should be specialized to eliminate undesirable inferences  Trends and predictions result from proper data analysis. This is where meaning becomes evident
    43. 43. “ ” Hyped indiscriminately and handled inappropriately, big data analytics can be more of a liability than an opportunity to derive rich information through intelligent refinement.
    44. 44. The Challenge of Big Data  Does logging in with social media accounts constitute consent?  Will aggregation and data masking still lead to personally identifiable information?  How can data privacy filtering be guaranteed?
    45. 45. Big Data example: the Click dataset  Objective: “To study the structure and dynamics of Web traffic networks”  53.5 billion click anonymized dataset @IndianaU  Data collected includes referrer, timestamp, URL  Does sanitization == anonymization == privacy? Source: CNetS:
    46. 46. Open, Distributed DIY Big Data Tools  D3: Data Driven Documents  GitHub | SourceForge  Hadoop/MapReduce  Amazon cloud  Open source grids  Mechanical Turk? Source: Wikipedia Recent Changes Map and wikistream projects
    47. 47. Canadian Big Data: The Source  Detailed metrics showed preference for high end products  Move away from $150 items to $650 ones increased sales 40% in high end electronics  Notorious for overcollection, the Source actually does „the consent bit‟ adequately well
    48. 48. Privacy enjoys safe sets  intrinsic safety in: meteorology, environmental, physics, astronomy and other sciences  innate risk in connectomics, biological, Internet, behavioural and sensory data sets
    49. 49. “ ” We simply cannot afford to entertain the notion that the proliferation of scattered data sources is the last bastion of privacy protection.
    50. 50.  Build privacy into the input data sets  Use simple filtering for large data sets & output  Build algorithms to ensure irreversibility of privacy  Try to break it! Technical solution of Big Data privacy
    51. 51. Privacy must be tack[l]ed [head-]on
    52. 52. “ ” Data with the potential of being personally identifiable should be treated with the same veracity as dirty input.
    53. 53. 7 Steps to Building your own Healthy Information Ecosystem Articulate your vision Put your stop orders in place Assign roles and accountabilities Create processes to manage it Build controls and code standards from the bottom up Prioritize data ownership, integrity and classification Implement layered and automated audits
    54. 54. Big Data Leadership  Embrace openness, build on what works  Adopt standards for process and technology  Draft progressive legislation (Model CASL, PCI and even CP laws)  Encourage awareness, promote accountability  Applaud and showcase responsible innovation  Put forward important notions of information life cycle, data ownership, and privacy compliance
    55. 55. Big Data Links  Free course (Coursera)  More from  This presentation:  Visualization gallery:  + Awards  /  The signal and the noise (Nate Silver @
    56. 56.  Sharing data mining policies  Demonstrating fair use using audits  Caring about the purveyor spectrum  When it gets easy, it may be too late Big Data open discussion
    57. 57. Follow Read Connect