Big Data and Computational Ethics (Vladislav Shershulsky)

1,430 views

Published on

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,430
On SlideShare
0
From Embeds
0
Number of Embeds
148
Actions
Shares
0
Downloads
53
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big Data and Computational Ethics (Vladislav Shershulsky)

  1. 1. Big Data: applications, ethics, algorithms Vladislav Shershulsky Microsoft Rus ALMADA, Moscow 2013-07-02
  2. 2. The form • This is one hour presentation • Thus, I have almost no chances to teach you something valuable, or to discuss any really new and sophisticated results • This is simply overview of several interesting techniques related to one important, and often underestimated, topic
  3. 3. The content • The topic is “Computational ethics for Big Data apps” • It is not about why people should behave ethically • It is about the following: • how to force people to behave ethically while dealing with our apps • how to ensure, that our Big Data apps are “moral by design” • and what does it technically mean “to be digitally ethical”
  4. 4. The map Processing and Data “Upstream” and “Downstream” Social Interactions Consolidation and Aggregation De- & Re- personalization Data Leaking and Forensics Category Theory Traditional Ethics for Big Data (New) Information Ethics Categories on (Statistics  Structure) Universal Moral Grammar Deontic Logic and Obligations Algebra of conscience Differential privacy Predictive Analytics
  5. 5. I. Motivation: why should we care
  6. 6. Broadly acknowledged definition Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization Gartner, 2012
  7. 7. Big Data is essentially complex We usually are not aware of the generating laws, forming big data sets, and about the nature of relations between them. But their huge volume allows to infer laws (regressions…) and thus gives (with the limits of inference reasoning) to Big Data some predictive capabilities. P. Delort, 2011 A bit like material world natural objects / subjects
  8. 8. Big opportunity For governments: • Budget savings • Transparency and responsibility • Real insight into society • Optimal decisions
  9. 9. Big opportunity For people: • Self organization • Better experiences • Intelligent environment • Introspection
  10. 10. Big opportunity For business: • Converting products to services • Expanded value chains • New business models • Educated targeting F r o m P r o d u c t t o S e r v i c e V = V0 + A∙N + B∙N2 Value for customer Immanent value Volume value Network value On Premise Off Premise Big Data & BI Clients Employees Partners Mobility & Connectivity Vlalue Socialization of Business http://www.businesslogicsystems.com/Data%20Management
  11. 11. Big opportunity For IT industry • Next chance to change the world • Step towards internet of everything • Completely new markets
  12. 12. Big challenge For people • New lack of privacy • Automated justice • Need to understand AKMs at your backyard
  13. 13. Big challenge For business • Hard to comply • Easy to violate • Unexpected backfire • Need to defend sourcesTarget Predicts Pregnancy with Big Data http://smallbusiness.yahoo.com/advisor/target-predicts-pregnancy-big-data-104057627.html Why Netflix's Facebook app would be illegal By Julianne Pepitone @CNNMoneyTech March 27 VPPA arose from strange circumstances surrounding the failed Supreme Court nomination of Robert Bork. While Bork's nomination hearings were taking place in 1987, a freelance writer for the Washington City Paper talked a video store clerk into giving him Bork's rental history. Google facing legal threat from six European countries over privacy http://www.guardian.co.uk/technology/2013/apr/02/google-privacy-policy-legal-threat-europe
  14. 14. Big challenge For government • It is hard to be transparent • It is easy to overuse • Hard to defend sources George Orwell, 1984 http://budget4me.ru/ob/faces/home http://online.wsj.com/article/SB10001424052970203391104577124540544 822220.html?mod=googlenews_wsj http://www.wikileaks.org/ http://www.washingtonpos t.com/investigations/us- intelligence-mining-data- from-nine-us-internet- companies-in-broad-secret- program/2013/06/06/3a0c0 da8-cebf-11e2-8845- d970ccb04497_story.html
  15. 15. Big challenge For IT industry • Needs new hw and sw architecture to address scale • Needs to know how to protect • Needs to address extremely complicated usage scenarios • Risk of over-restrictive regulation
  16. 16. Pro Contra People: collective knowledge Business: from disordered offerings to quality of life service Government: know and address real needs of citizens IT industry: change the world (again?) People: final lack of privacy Business: disruptive scenarios Government: chance to miss everything in a minutes IT industry: new approaches to hw and sw architecture, addressing new challenges
  17. 17. II. Computational ethics and Big Data
  18. 18. Why ethics? • Benefiting from opportunities and mitigating risks assumes careful handling of digital assets of high business and personal value, both in known scenarios and in completely new situations • To proceed successfully one should follow some sort of fundamental principles – clear and consistent Ethics, also known as moral philosophy, is a branch of philosophy that involves systematizing, defending and recommending concepts of right and wrong conduct. http://www.iep.utm.edu/ethics/
  19. 19. Big Data and traditional ethics • Let’s take concepts from traditional ethics and examine how they should apply to digital world, and how they evolutionary evolve under influence of Big Data capabilities • Four Elements of Big-Data Ethics: Identity, Privacy, Ownership, Reputation • Big Data is ethically neutral • Personal data – not some specific data, but any data generated in the course of a person’s activities • Privacy interests, not always ultimate rights • A responsible organization is an organization that is concerned both with handling data in a way that aligns with its values and with being perceived by others to handle data in such a manner.Davis Kord. Ethics of Big Data. Balancing Risk and Innovation. O'Reilly Media, 2012
  20. 20. Big-Data ethics: Identity • Identity (in philosophy), also called sameness, is whatever makes an entity definable and recognizable (Wikipedia) • Christopher Poole vs. Mark Zuckerberg: “prismatic” multi-identity vs. mono-identity • Some governments concern about identity in the Internet: say Italy and Belarus requires ID to obtain access • Does Big Data allows to re-construct identity? Partially following Davis Kord. Ethics of Big Data.
  21. 21. Big-Data ethics: Privacy • Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby reveal themselves selectively (wikipedia). • In 1993, the New Yorker published a cartoon whose caption read: “On the Internet, nobody knows you’re a dog” At the time, this was funny because it was true. Today, in the age of big data, it is not only possible to know that you’re a dog, but also what breed you are, your favorite snacks, your lineage, and whether you’ve ever won any awards at a dog show. • There are two issues. First, does privacy mean the same thing in both online and offline in the real world? Second, should individuals have a legitimate ability to control data about themselves, and to what degree? Following Davis Kord. Ethics of Big Data.
  22. 22. Big-Data ethics: Ownership • The degree of ownership we hold over specific information about us varies as widely as the distinction between privacy rights and privacy interests. • Do we, in the offline world, “own” the facts about our height and weight? Does our existence itself constitute a creative act, over which we have copyright or other rights associated with creation? • How do those offline rights and privileges, sanctified by everything from the Constitution to local, state, and Federal statues, apply to the online presence of that same information? • To the end of the day we more and more pay for “free” online services by our data sharing with its providers Following Davis Kord. Ethics of Big Data.
  23. 23. Big-Data ethics: Reputation • As recently as 20 years ago, reputation consisted primarily of what people, specifically those who knew and frequently interacted with you, knew and thought about you. In some cases, a second-degree perception – that is, what the people who knew you said about you to the people who they knew – might influence one’s reputation. • One of the biggest changes born from big data is that now the number of people who can form an opinion about what kind of person you are is exponentially larger than it was a few years ago. • And further, your ability to manage or maintain your online reputation is growing farther and farther out of individual control. • There are entire companies now whose entire business model is centered on “reputation management” Following Davis Kord. Ethics of Big Data.
  24. 24. Benefits of ethics inquiry • Faster consumer adoption by reducing fear of the unknown (how are you using my data?) • Reduction of friction from legislation from a more thorough understanding of constrains and requirements • Increased pace of innovation and collaboration derived from a sense of purpose generated by explicitly shared values • Reduction in risk of unintended consequences from an overt consideration of long-term, far-reaching implications of the use of big-data technologies Partially following Davis Kord. Ethics of Big Data.
  25. 25. But now we want to go even further We need more formal theory(ies) and more practical instruments to incorporate ethical behavior into our products and services. There are several important reasons to do so: • We intend to ensure and enforce ethical behavior of our apps • We hope that formal theory and algorithms will help us to find reasonable (at least not very self-contradictory) solutions in cases where our obvious life experience does not works • We need to comply with complicated regulation, and, the same time, to cheque its consistency • We have to be ready to disruptive changes in adopted by society of-line ethics under influence of IT and of Big Data in particular
  26. 26. What to expect from computational ethics • We need a set of formal rules to classify steps (state changes) as ethically “right” or “wrong” (actually there are a bit more options), and to find most acceptable steps in a cases where our intuition provides no solution • We need to apply these rules in a consistent way to both people (subjects), information objects (which in Big Data tasks have certain level of autonomy due to complexity) and to collectives of subjects and objects (collectives as moral agents as well)
  27. 27. Computational ethics vs. Roboethics Machine Ethics (or machine morality) is the part of the ethics of artificial intelligence concerned with the moral behavior of Artificial Moral Agents (AMAs) (e.g. robots and other artificially intelligent beings). Machine ethics is sometimes referred to as computational ethics and computational morality. In contrasts roboethics concerns with the moral behavior of humans as they design, construct, use and treat such beings. Here we do not talk about any sort of robotics or any similar professional code of conduct (i.e. it is not about hacker or software developer ethics) This area of research recently draw a lot of attention due to Wendell Wallach from Yale University urging U.S. and other governments regarding drone and similar AKMs proliferation
  28. 28. A few useful approaches we shell just mention them and discuss in more details later Most fundamental concepts • Information logic by Luciano Floridy. For many reasons this is good background for the rest of approaches. • Information accountability. Alternative to “all prohibited by default” approach in access management. • Deontic logic. The field of logic concerned with obligation, permission, and related concepts, and, subsequently, – a formal system that attempts to capture the essential logical features of these concepts. Good (but not the most general) approach to expressing access/usage rules. More general modal logics are of interest as well.
  29. 29. A few useful approaches we shell just mention them and discuss in more details later Practical concepts and calculi • Moral grammar. Still not all-embracing, but very expressive way to describe and solve situations of moral choice. • (Auto)reflexive multi-agent calculus of conflict and cooperation. The first model to describe the role of reflection in ethics, moral choice in conflicts, difference between intention and readiness. Introduced simple morale agents classification. Popular in Russia due to expressiveness and predictivity, but rarely known abroad. • Differential privacy. Means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. • Models, calculi, and languages to describe and derive ownership, access restrictions, and rules, including obligation specific. Privacy-aware Role Based Access Control (P-RBAC), Web Ontology Language (OWL, OWL2), and eXtensible Access Control Markup Language (XACML).
  30. 30. A few “ethically neutral” but useful instruments Category Theory Category theory is a toolset for describing the general abstract structures in mathematics. As opposed to set theory, category theory focuses not on elements x,y,⋯ – called objects – but on the relations between these objects: the (homo)morphisms between them 𝑥 𝑓 𝑦 Computation on encrypted data Ability to perform some operations (such as searches, comparisons) on data encrypted by another entity. Area of active research with some prominent results. Often referred as homomorphic encryption.
  31. 31. A few references Information logic. • Luciano Floridi. Information ethics, its nature and scope. In SIGCAS Computers and Society, Volume 36, No. 3, September 2006, pp. 21-36. Information accountability. • Daniel Weitzner, Harold Abelson, Tim Berners-Lee, Joan Feigenbaum, James Hendler, Gerald Sussman. Information Accountability. Communications of the ACM, Jun. 2008, 82-87 Deontic logic. • Paul McNamara, Deontic Logic, Stanford Encyclopedia of Philosophy, 2006, 2010. Moral grammar. • John Mikhail. Moral Grammar and Intuitive Jurisprudence: A Formal Model of Unconscious Moral and Legal Knowledge. In The Psychology of Learning and Motivation: Moral Cognition and Decision Making. D. Medin, L. Skitka, C. W. Bauman, D. Bartels, eds., Vol. 50, Academic Press, 2009, pp.27-100 (Auto)reflexive multi-agent calculus of conflict and cooperation. • Vladimir Lefebvre. Algebra of conscience. Kluwer Academic Publishers, Dordrecht, Boston. 2001, 358 p. • Vladimir Lefebvre, Thomas Reader. Reflexive IW Model II. Report No.(s): AD-A399417; ARL-SR-114. Feb. 2002; Army Research Lab., Survivability/Lethality Analysis Directorate, White Sands Missile Range, NM USA, 50p. Differential privacy. • Differential Privacy: A Survey of Results by Cynthia Dwork, Microsoft Research April 2008 P-RBAC, OWL, XACML. • Qun Ni, Dan Lin, Elisa Bertino, and Jorge Lobo. Conditional Privacy-Aware Role Based Access Control. ESORICS 2007, LNCS 4734. J. Biskup and J. Lopez (Eds.). Springer-Verlag. Berlin, Heidelberg, 2007, pp.72-89. • For OWL see http://en.wikipedia.org/wiki/Web_Ontology_Language • For XACML see XACML 3.0 - committee specification 01. OASIS (oasis-open.org). Retrieved 10-August-2010 Category theory. • http://ncatlab.org/nlab/show/category+theory#idea_10 • Saunders MacLane. Categories for the Working Mathematician. 2nd edition, Springer Verlag, 1998, 314p. Computation on encrypted data. • Abdullatif Shikfa. Computation on Encrypted Data: Private Matching, Searchable Encryption and More. Bells Lab, Alcatel-Lucent. 2013
  32. 32. Why Information Ethics I will cover other instruments later while describing specific Big Data scenarios or operations, but Information Ethics due to its universality needs discussion in advance. Information Ethics is the theoretical foundation of applied Computer Ethics. IE is an expansion of environmental ethics towards 1) a less anthropocentric concept of agent, which now includes also non-human (artificial) and non-individual (distributed) entities; and 2) a less biologically biased concept of patient as a ‘centre of ethical worth’, which now includes not only human life or simply life, but any form of existence. 3) a conception of environment that includes both natural and artificial (synthetic, man- made) eco-systems. IE is therefore: non-standard, patient-oriented, not agent-oriented, environmental, non- anthropocentric but ontocentric, and based on the concepts of informational object/ infosphere/entropy rather than life/ecosystem/pain. Luciano Floridi, Lecture @ Cambridge, 2006
  33. 33. Luciano Floridi, Lecture @ Cambridge, 2006
  34. 34. Infosphere, patient oriented Ecology, patient oriented Bioethics, patient oriented All-human, agent oriented Brief history of ethics Athenian citizens
  35. 35. Moral Patient • Question: what is the lowest possible common set of attributes which characterizes something as intrinsically valuable and an object of respect, and without which something would rightly be considered intrinsically worthless or even positively unworthy and therefore rightly disrespectable in itself? • Answer: the minimal condition of possibility of an object’s least intrinsic worthiness is its abstract nature as an informational entity. • Conclusion: all entities, interpreted as clusters of information, have a minimal moral worth qua informational objects, that deserves to be respected. Luciano Floridi, Lecture @ Cambridge, 2006
  36. 36. Four Principles of Information Ethics • IE determines what is morally right or wrong, what ought to be done, what the duties, the “oughts” and the “ought nots” of a moral agent are, by means of four basic principles: 0. entropy ought not to be caused in the infosphere (null law) 1. entropy ought to be prevented in the infosphere 2. entropy ought to be removed from the infosphere 3. the welfare of the infosphere ought to be promoted by extending it, improving it and enriching it. Luciano Floridi, Lecture @ Cambridge, 2006
  37. 37. Conclusions for Big Data • When we recognise, that not only humans (actors), but also (at least some) information objects (patients) have rights that should be respected, and that under certain conditions entropy minimisation is immoral, we shell have to agree that our privacy rights are only privacy interests and should be balanced with the rights of information objects we created. • Big Data (subsets, elements of BD) are complicated enough to satisfy, in many cases, criteria of moral patient(s). • Our ultimate privacy requirement becomes thus questionable. • Do we really always have “Right to Forget” (i.e. somehow kill information objects) ? • In some future information objects representing our identity in infosphere would become autonomous enough (shadow freewill dilemma). • IE
  38. 38. Conclusions for Big Data Information object rights to respect • If you consider this problem abstract, just think about recent attempts in several countries to prohibit car video-registrators as potentially violating privacy of passengers in another cars. • Or take into account that countries with over-restrictive regulation of Big Data collecting and usage to be at risk of economic and social slowdown. • Can we explain all this to legislators? Many of them are still at the stage of adopting ecology. Is this practical? Yes. It helps to architect big data management system in a consistent way: • Your predictive conclusions have little value if you kill information objects voluntarily • Access mechanisms to respect information object rights help to avoid massive leaking and full service blocking (“all is prohibited by default”)
  39. 39. III. Sources of Big Data
  40. 40. Humans as data sources Per person per day (in “golden billion”) • 50-200 e-mails • 10-50 voice calls • 1-100 SMS and twits • 0.1 blog posts • 1-20 financial transactions • 3-30 search requests • 10-30 articles, read on the Internet • 10 audio records • 30-90 minutes of TV/Video • 20-200 appearances in video monitoring cameras • 1-100 geospatial “notches” • 20-200 RFID checks • 0,05-10 healthcare records And art least 4.5 billions of people have at least phones (mostly wireless) From The Human Face of Big Data by Rick Smolan. EMC inspired.
  41. 41. Things as data sources T h e I n t e r n e t o f T h i n g s
  42. 42. World today
  43. 43. …and tomorrow
  44. 44. Subjects and objects (both actors and patients) Humans, unique by default • Proactively using ITC infrastructure (social networks, online services, etc.) • Passively “catching sight” of monitoring and recording systems Real world Identifiable objects • Active networked devices • Passive objects, identifiable by context In-between • Pets with RFIDs • Animals in wild nature • AI agents (if, and when originating) Abstract objects • Pieces of information (code and data, i.e. objects) in ICT systems with their natural behavior and lifecycle
  45. 45. IV. Procedures and Scenarios
  46. 46. Processing • Operations conducted at centralized location, but in highly distributed and parallelized manner, to prepare Big Data for consumption • Most popular paradigm of Big Data processing is Map/Reduce • Many experts consider Map/Reduce, or the math behind this paradigm, to be the heart of Big Data
  47. 47. Pieces of data (information objects) • Can relate/belong to one subject/object or can describe some sort of relation between subjects and/or objects Examples: relates to individual entity: birth day, location, relation: friend, call record, etc. • Generally has structure <Key, Value>. Can contain one simple value, well structured data, or complicated unstructured or semi- structured (nested) set (tuple) of data (and associated code) Examples: name, location, friend list, blog, etc.
  48. 48. “Upstream” • Users supply some personally identifiable info to some centralized location to further consume some sort of social, business or governmental online service Examples: Facebook, LinkedIn, gosuslugi.ru, etc. • Devices supply some spatially/personally associated info to some sort of centralized location for further analyses or transactional services Examples: mobile billing, smart power grids, weather forecast stations, wind tunnel data stream acquisition system
  49. 49. “Downstream” • User extracts some personally identifiable information related to him (her) or to other persons to consume some sort of social, business or governmental online service Examples: Facebook, LinkedIn, gosuslugi.ru • User or device (the same as above or different) extracts some information supplied by networked devices and/or users to consume or provide service Examples: Internet shallow search (in most cases), mobile billing, smart power grids, traffic advising systems
  50. 50. Social interaction • In many cases while consuming social or transactional online service users or devices directly or indirectly interact with other users and/or devices (i.e. with their virtual agents inside Big Data set), and thus form complicated meta-structures with certain level of stability and specific meta-behavior • It would be wrong to assume, that only humans or higher animals demonstrate social behavior. Devices and math structures (virtual agents) with reflection (and auto-reflection) demonstrates it as well (at least we agreed to consider this social behavior) Example: collaborative antipersonnel mines • Far not all Big Data collections and servuces are social in nature
  51. 51. Predictive / inductive analytics • Set of mathematical techniques aimed to obtain nonobvious knowledge from Big Data sets • Includes: • Mathematical statistics • Game theory • Multiagent models (physical, economic, econophysical, social…) • Graph theory • Machine learning • Deep neural networks • Linear and nonlinear programing • …and more
  52. 52. Consolidation • Service (service provider) collects all of all existing or substantial part of relevant data, representing / covering specific sort of information / subject matter info Examples: all suppliers of airplane spуres, all citizens with medical or armed forces experience, etc.,
  53. 53. Aggregation • User or service on behalf of special sort of user (probably, privileged) performs operation aimed to obtain some common (intensive or extensive) characteristics of data set or its large subset as a whole. Examples: capacity of airplane spуres suppliers , national rare-earth metal proven reserves, number of citizens with medical or armed forces experience by region etc. • Does this operation produces really new knowledge?
  54. 54. De-personalization / de-identification • Removal or hiding of personal or sensitive identifying information from data sets or from query responces • Should not block (not contradict) at least most common (or all ?) aggregation operations • Often required while producing Open Data
  55. 55. Re-personalization / re-identification Use of sophisticated algorithms and additional (extended, external, open… ) sets of data to re-construct personally identifiable information Recent example: usage of several openly available data sets (purchasing, property, etc.) to reconstruct personal information from anonymized medical records sets Is it always illegal ?
  56. 56. Data theft / leaking • There are many scenarios of data theft to take into consideration while protecting online service or offline business IT system. • But one is Big Data specific. Big Data processing assumes consolidation of disparate data sources to unified (even if physically distributed) warehouse and access to the whole set of data for execution of some principally not local operation (as building reverse index etc.). • At least early noSQL systems were designed the way which potentially could provide access to all data from one specific (internal and privileged) account, thus forming threat of stilling large amount of data at a time.
  57. 57. Digital forensics • Forensics – the use of science and technology (digital in our case) to investigate and establish facts in criminal or civil courts. • In many cases utilizes capabilities not provided to ordinary users. • How should it be regulated ?
  58. 58. V. Math and (some) Legal
  59. 59. Math behind processing • Map / Reduce effectiveness relies largely at commutative(ity) / associativity of its operations at each step. This allows to abstract form details of operations happening at one computer node with one chunk of data. Instead most of operations can be described in terms of “natural transformations” or “mappings” (surprise ), thus making category theory natural language to talk about. Just for illustration purposes • We can, potentially, benefit form usage such language's as Haskell (ok, F# also, and, may be, Python…) • And we can search, whether there any practical problems, already described in terms of category theory
  60. 60. Categories at a glance http://www.cs.indiana.edu/cmcs/categories.pdf
  61. 61. Rosetta stone (new one) • Category theory is really near universal. It helps to streamline and unify concepts in various fields – from string theory to process management. It was proclaimed in well known article Is something missed here? Seems, yes. It is mathematical linguistics and its applications to search and information retrieval But what about ethics? We’ll see soon.
  62. 62. Digital behaviorism vs. charm of simplicity or bag of words vs. universal grammar Peter Norvig Director of Research for Google Noam Chomsky Professor (Emeritus) at MIT, "father of modern linguistics" A bit oversimplifying discussion after MIT anniversary session: Chomsky wishes for an elegant theory of intelligence and language that looks past human fallibility to try to see simple structure underneath. Norvig, meanwhile, represents the new philosophy: truth by statistics, and simplicity be damned. Kevin Gold for tor.com What is important here: Noam Chomsky really developed elegant and practically useful universal grammar theory, which declares that all languages are essentially similar and can be generated from a sort of symmetry structure with relatively small number of parameters. This leads to a conclusion, that ability to learn grammar is “hard-wired into the brain”. Joachim Lambek proposed type enriched approach to grammar structuring, which, finally, evolved into category grammar models.
  63. 63. Evolution of linguistics and cognitive science as well… Empiric linguistics Early grammarsEarly vocabularies Formal grammarsThesauruses and ontologies Universal grammarHuman as a bag of senses The minimalist programStatistical bag of words approach Categorical approachesDeep search models “Quantum” grammar
  64. 64. “Quantum” Linguistics
  65. 65. Merging the worlds Meaning of the word lies in its context Statistical model of semantics can be, in large, reduced to clustering in appropriate vector space of n- gram probabilities Sentence formal structure (and some specific part of its meaning) depends on its formal structure Text is considered as algebraic structure And now lets merge these mutually complementary approaches (through  composition) and use this new “space” to represents sentences both with semantics and formal structure 1 2 3 4
  66. 66. When and why do we need such models? Shallow Search Deep Search Translation Regulation Natural language ERP readable XBRL like “Translation” Formal structure of sentences, ontologies for terms with special meaning, bags of words for the rest of vocabulary and to validate terms usage Regulation Natural language “Translation” • Formal structure of sentences, • Ontologies for terms with special meaning, • “Bags of words” for the rest of vocabulary and to validate terms usage
  67. 67. What about pieces of data • Generally <Key, Value>. But not so simple. Key and Value can be nested tuples. Primitive values can be large blobs of text or other media. Pieces of data actually can contain code or assume execution of some code. • Traditional large scale distributed file systems (like HDFS) targeting Big Data applications initially ignored the nature of individual information objects both in terms of placement and access rights (i.e. security). Their core metaphor was enormous flat files. This is rurally optimal approach for social networks, news search, transactions. • Seems, recent implementations (such as for Storm, Caffeine, and Singularity) are more granular, and, in some cases, respects individual object access requirements. • In ideal world all information objects should «live» in an infinite flat storage interacting with each other and with external actors according to their contracts. Challenging goal in a multi-petabyte world.
  68. 68. What are the problems with “Upstream” http://themainstreetanalyst.com/2013/01/18/most-popular-social-networking-sites-by-country-infographic/ It is huge. Not only there are a lot of business and state systems, that supply data abbot almost everything to some warehouses, but there are a lot of people voluntarily supplying their data to social networks. More then 1B Facebook users. Little chances to change pattern without disruption.
  69. 69. What are the problems with “Upstream” Sharing personal data… …by people, voluntarily: feature, not bug People supply enormous amount of data allowing to identify and research them, their relatives, businesses, state institutions, etc. Some of them ignore or not aware of risks, most value benefits of global socialization higher then damage caused by leak of some privacy. I.e. they “pay” their personal data to receive social services. This is reality in which legislators, lowers, security experts and IT pros have to live. Rather then enforcing restrictive supply regulation legislators could enforce responsible usage. Not a simple approach too, of course.
  70. 70. What are the problems with “Upstream” Sharing personal data… …by devices: need moral justification State monitoring systems are great instruments to improve national security, and there are little if any objections against their implementation. Should people be always explicitly notified, when interacts with them? Officials often missed when defending such systems. Just an example: “it was illegal to track a person without permission from the authorities, but that there was no law against tracking the property of a company, such as a SIM card.” Seems this is about “crime scene investigation”. It does not require explicit permission from potentially affected citizens. How far should we expand its scope: locale crime scene? City? State? How long after the case? Before the case?
  71. 71. (Personal) data as a new currency (and tax) People pay (invest) their data to obtain benefits To (commercial) social services • If consider them valuable • If trust providers considering then reputable and responsible To state security, tax and otter systems • If consider them effective • If trust administration considering it legitimate and responsible Lack of trust in this area can cause economic slowdown and social frustration
  72. 72. Supply and Identity It was mentioned frequently, that approach to Identity in digital world can evolve. People can either continue with their traditional approach to identity (which in real world intends to be not only unique, but also single and “integral” for each subject), or to switch to multiple partial complimentary or even contradicting identities. Evidently both social service providers and state institutions prefer the first approach. But process is far not accomplished and can turn to the second option is some geos or social groups if over-pressed by restrictive regulation in a situation of luck of mutual trust. Also, we shell soon face more and more complicated strategies to misinform supply mechanisms developed by criminals etc. Multiple SIM-cards is only the beginning. What math should be used to re-construct identity in a case of untrusted records in Big Data set ?
  73. 73. What are the issues with “Downstream” Generally speaking there are two main problems (ok, there are more, but out of our scope): • Who is allowed to retrieve results from Big Data ? • What is the quality of results (are they complete and accurate)?
  74. 74. Who is allowed to retrieve results Responsible usage If you are old school IT security expert, your answer, most probably, would be: “nobody, never, nothing”. At least by default. In a few cases (highly bureaucratized hierarchical small to medium organization) it works. But hardly in a case of 1B+ users (and they are different – individuals of different age, provider employees, businesses, state officials, devices, etc.). First what we’d do is to implement elements of responsible approach to information as Tim Berners-Lee and his colleagues recommend. Just explicitly inform potential user about usage regulation and potential liability. It easier to say, then to accomplish, as such approach requires mechanism to associate norms regulation with pieces of information. See above slide “When we need such models”, and, potentially, ontologies of usage patterns.
  75. 75. Who is allowed to retrieve results Obligations and deontic logic Nevertheless, would society adopt strict or lightweight approach to access management, there will be areas of strict control due to national security, or to children rights protection, etc. We should be able to describe obligations, associates with permissions. Some should precede action, some – to be accomplished in course or during pre-defined timeframe after the action. Deontic logic, while being not free of paradoxes, is a good background for this. permissible (permitted) must impermissible (forbidden) supererogatory (beyond the call of duty) obligatory (duty, required) indifferent / significant omissible (non-obligatory) the least one can do optional better than / best / good / bad ought claim / liberty / power / immunity Deontic logic is that branch of symbolic logic that has been the most concerned with the contribution that the following notions make to what follows from what http://plato.stanford.edu/entries/logic-deontic/ Since early 80th deontic logic is used to formalize legal reasoning and legal corpus, access rights, database protection, as well as moral choice in IT (see J.-J.Ch. Meyer and Roel J. Wieringa. Applications of deontic logic in computer science: A concise overview. In J.-J.Ch. Meyer and Roel J. Wieringa, editors, Proceedings of the 1st International Workshop on Deontic Logic in Computer Science (DEON 1991), Amsterdam, The Netherlands, December 11-13, 1991, pages 15-43)
  76. 76. Deontic logic at a glance John Mikhail, Universal Moral Grammar: Theory, Evidence, and the Future. Trends in Cognitive Sciences, April 2007; Georgetown Public Law Research Paper No. 954398. We shell discuss Mikhail’s works in a bit more detailes soon
  77. 77. Who is allowed to retrieve results Obligations and ethical calculus Consider the following sentences: 1. A informs B 2. A tells B that p 3. A lets B know that p 4. A shares his knowledge that p with B 5. A informs B about p 6. A sends a message to the effect that p to B 7. A's communications to B indicate that p The general form of (1)-(7) can be rendered as: • Agent A in informational context C sees to it that Agent B believes that p, or • A informs B that X Moral or legal constraints in information contexts may be expressed as follows in general terms as follows: • It is (not) obligatory or (not) permitted for A to see to it that B knows that p Intellectual Property (A) If John has an IP right in a particular piece of information X, then Peter ought to have permission from John to acquire, process or disseminate X. Privacy and Data Protection (B) If information X is about John and if Peter does not have X then Peter is not permitted to acquire X without John's consent. If he does have X, then he is not permitted to process or disseminate it without John's consent. Equal Access (C) If A is informed about X, then all ought to be informed about X. Responsibility and Information (D) If John has an information responsibility regarding X, then John has an obligation to see to it that specific others have access to information X. Jeroen van den Hoven and Gert-Jan Lokhorst. Deontic Logic and Computer-Supported Computer Ethics. 2002
  78. 78. What is the quality of results False positive in Big Data: Bonferroni’s principle While preparing report on Big Data one may face several problems: • Not all appropriate pares <key, value> identified • Some wrong pares selected • And the most interesting, some correct pares can satisfying search criteria, but not due to their nature, but simply due to statistically large size of data sample Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for. This observation is the informal statement of Bonferroni’s principle. In a situation like searching for terrorists, where we expect that there are few terrorists operating at any one time, Bonferroni’s principle says that we may only detect terrorists by looking for events that are so rare that they are unlikely to occur in random data. Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman. Mining of Massive Datasets, CUP, 2010, 2012
  79. 79. What are the issues with social interaction From access control to social interactions People (and already some other information agents) establish social interactions – fall into cooperation and conflicts, make moral choices and moral evaluations. If we wanna build services processing data, supplied by billions of people, and expect the services to behave in a “natural” way (either respond accordingly or make grounded predictions), we need to think about implementation of ethical norms. Area to apply ethical concepts in Big Data and social network applications is wide – from obligation enhanced access control to ethical evaluation of autonomous decisions. Why we need ethical evaluation? It is almost impossible to predict and list all micro-scenarios of service usage: should service provide to agent A info regarding agent B if it will impact decisions A can make regarding gent C ? If A is a physician ? If A is (or is not) his patient ? In case of emergency ? Do not forget, that modern “autonomous artificial intellectual agents” normally invoke some predictive analytics over statistically large sets of data (Big Data) to make decisions. And this decisions should be ethical. At least to be understandable to humans.
  80. 80. What are the issues with social interaction Why ethical behavior is usually considered understandable Let’s turn back to Jon Mikhail works (See Mikhail, John, Universal Moral Grammar: Theory, Evidence, and the Future. Trends in Cognitive Sciences, April 2007; Georgetown Public Law Research Paper No. 954398. and earlier) Possible, and very attractive answer is: “Because ethics, or at least, some main ethical categories, are built in our brains”. So humans are “ethical by design” as computers should be. And nothing common with creationism here… Michail’s Universal Moral Grammar follows Chomsky’s Universal Language Grammar approach Five main questions of universal moral grammar • What constitutes moral knowledge? • How is moral knowledge acquired? • How is moral knowledge put to use? • How is moral knowledge physically realized in the brain? • How did moral knowledge evolve in the species?
  81. 81. What are the issues with social interaction A bit more about Universal Moral Grammar UMG proposes a set of rules to process transformation of questioned situation: from initial description through structural analyses, temporal decomposition, deontic analyses to final decision. UMG takes into account different evaluation by humans of direct impact and side effects etc.
  82. 82. What are the issues with analytics Which analytics do we need? Again, there is a sort of competition between statistical and structural approaches. But in case of analytics it is evident, that we need both (and, in some cases, multi-agent simulation as well). A few methods/disciplines to mention: • Mathematical statistics and statistical hypothesis testing • Linear and nonlinear programming, optimization • Neural networks (yes, finally we have workable deep neural networks!), multidimensional approximations • Game theory • Graph theory • Coding theory • Mathematical logic, various flowers • Formal linguistics • Statistical econophysics…
  83. 83. What are the issues with analytics Why are they all different? Let’s limit ourselves to one important problem. How to describe and analyze the situations of cooperation and conflict and different roles of actors, and how to explain that different actors behave differently in similar situations remaining, the same time, ethical (not violating their own moral rules). This is the question from Michail again: Moral diversity is far not only geo-cultural. It is mental and group-social as well.
  84. 84. What are the issues with analytics In search of language and calculus for compromise and conflict There are a number of approaches aiming to describe this interactions, classify their participants, and predict their steps. Among others – (auto)reflexive approach by Vladimir Lefebvre developed in 70th and 80th first in USSR, and then in USA. Lefebvre proposed (set of) simple ethical calculus based at Boolean logic, and found, that there are two self-consistent ethical systems (see Lefebvre, V.A. Algebra of Conscience. D. Reidel, Holland, 1982) First Ethical System Second Ethical System • The end DOES NOT justify the means • If there is a conflict between means and ends, one SHOULD be concerned • The end DOES justify the means • If there is a conflict between means and ends, one SHOULD NOT be concerned First Ethical System Second Ethical System • A “saint” is non-aggressive, tends toward compromise, and has low self-evaluation • A “hero” is non-aggressive, tends toward compromise, has high self-evaluation • A “philistine” is aggressive, tends toward conflict, and has low self-evaluation • A “dissembler” is aggressive, tends toward conflict, has high self-evaluation • A “saint” is aggressive, tends toward conflict, and has low self-evaluation • A “hero” is aggressive, tends toward conflict, and has high self-evaluation • A “philistine” non-aggressive, tends toward compromise, and has low self-evaluation • A “dissembler” non-aggressive, tends toward compromise, and has low self-evaluation Simple taxonomy for social interactions: Lefebvre, V.A.: Algebra of Conscience. D. Reidel, Holland (1982)
  85. 85. What are the issues with analytics Algebra of conscience Again, notation form theory of categories
  86. 86. Big Data specific theft / leaking Anatomy of (some of) grand leaks There are many scenarios of data theft to take into consideration while protecting online service or offline business IT system. But at least one is specific for Big Data. When people (businessmen and politicians) first become acquainted with the technology, they attempted to use it for predictive analyses of all non-quantitative sources at a time. Idea was: “Let’s collect all evidences, nevertheless reliable or not, and attempt to classify them, using the same corpus for verification of each statement”. Supplies were instructed to input into such systems everything – from official documents through rumors. And let computer separate the wheat from the chaff. For the first ever there appeared huge warehouses of text documents. At one place, with unified access from one (internal and privileged, but one) system account. It was matter of time only, and massive leaks happen. Any protection mechanisms? Physical security. Access to different file system chunks or information objects from different accounts, sophisticated access rights management, and, mostly in some near future, computation on encrypted data (Homomorphic encryption).
  87. 87. The issues with de-/re- personalization Differential privacy Big Data is huge valuable asset. But how to use it without violation of privacy? Say, how to arrange some privacy-preserving sharing of Big Data between state and business? Just one example: Pharmacy chain requires statistical/demographic/etc. data to expand and tune its business, but prohibited to access personal medical records. The simplest answer is “open data” – statically privacy- (and other secret-) cleaned sets of data. Good start, but far not enough. Now you mostly are limited to predefined reports. Moat of Big Data power is lost. Real question is – how to ensure privacy dynamically – over complete set of data, i.e. – how to ensure suppliers, that participation in Big Data sets is not privacy risky? Good answer in – Differential Privacy There are more sophisticated definitions/ requirements as well Ilya Mironov, Omkant Pandey, Omer Reingold, and Salil Vadhan, Computational Differential Privacy, in Advances in Cryptology—CRYPTO 2009, Springer, 2009
  88. 88. More to discuss • Contract theory and ethics • Econophysics and ethics • Differential privacy and main laws of Information Ethics • Applications
  89. 89. Questions? Vlad Shershulsky vladsh@microsoft.com

×