R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014


Published on

Big Data Meets Open Data: Challenges and Issues presentation of Roger Longhorn, Operations & Communications Manager, GSDI Association, delivered at the Taiwan Open Data Forum, 9 July 2014 in Taipei

Published in: Government & Nonprofit
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Open Definition - Version 1.1


    The term knowledge is taken to include:
    Content such as music, films, books
    Data be it scientific, historical, geographic or otherwise
    Government and other administrative information
    Software is excluded despite its obvious centrality because it is already adequately addressed by previous work.

    The term work will be used to denote the item or piece of knowledge which is being transferred.

    The term package may also be used to denote a collection of works. Of course such a package may be considered a work in itself.

    The term license refers to the legal license under which the work is made available. Where no license has been made this should be interpreted as referring to the resulting default legal conditions under which the work is available (for example copyright).

    The Definition

    A work is open if its manner of distribution satisfies the following conditions:

    1. Access
    The work shall be available as a whole and at no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The work must also be available in a convenient and modifiable form.
    Comment: This can be summarized as ‘social’ openness – not only are you allowed to get the work but you can get it. ‘As a whole’ prevents the limitation of access by indirect means, for example by only allowing access to a few items of a database at a time (material should be available in bulk as necessary). Convenient and modifiable means that material should be machine readable (rather than, for example, just human readable).

    2. Redistribution
    The license shall not restrict any party from selling or giving away the work either on its own or as part of a package made from works from many different sources. The license shall not require a royalty or other fee for such sale or distribution.

    3. Reuse
    The license must allow for modifications and derivative works and must allow them to be distributed under the terms of the original work.
    Comment: Note that this clause does not prevent the use of ‘viral’ or share-alike licenses that require redistribution of modifications under the same terms as the original.

    4. Absence of Technological Restriction
    The work must be provided in such a form that there are no technological obstacles to the performance of the above activities. This can be achieved by the provision of the work in an open data format, i.e. one whose specification is publicly and freely available and which places no restrictions monetary or otherwise upon its use.

    5. Attribution
    The license may require as a condition for redistribution and re-use the attribution of the contributors and creators to the work. If this condition is imposed it must not be onerous. For example if attribution is required a list of those requiring attribution should accompany the work.

    6. Integrity
    The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work.

    7. No Discrimination Against Persons or Groups
    The license must not discriminate against any person or group of persons.
    Comment: In order to get the maximum benefit from the process, the maximum diversity of persons and groups should be equally eligible to contribute to open knowledge. Therefore we forbid any open-knowledge license from locking anybody out of the process.
    Comment: this is taken directly from item 5 of the OSD.

    8. No Discrimination Against Fields of Endeavor
    The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for genetic research.
    Comment: The major intention of this clause is to prohibit license traps that prevent open material from being used commercially. We want commercial users to join our community, not feel excluded from it.
    Comment: this is taken directly from item 6 of the OSD.

    9. Distribution of License
    The rights attached to the work must apply to all to whom it is redistributed without the need for execution of an additional license by those parties.
    Comment: This clause is intended to forbid closing up knowledge by indirect means such as requiring a non-disclosure agreement.
    Comment: this is taken directly from item 7 of the OSD.

    10. License Must Not Be Specific to a Package
    The rights attached to the work must not depend on the work being part of a particular package. If the work is extracted from that package and used or distributed within the terms of the work’s license, all parties to whom the work is redistributed should have the same rights as those that are granted in conjunction with the original package.
    Comment: this is taken directly from item 8 of the OSD.

    11. License Must Not Restrict the Distribution of Other Works
    The license must not place restrictions on other works that are distributed along with the licensed work. For example, the license must not insist that all other works distributed on the same medium are open.
    Comment: Distributors of open knowledge have the right to make their own choices. Note that ‘share-alike’ licenses are conformant since those provisions only apply if the whole forms a single work.
    Comment: this is taken directly from item 9 of the OSD.
    - See more at: http://opendefinition.org/od/#sthash.zDVGWjW1.dpuf

  • G8 Open Data Charter


    1. The world is witnessing the growth of a global movement facilitated by technology and social media and fuelled by information – one that contains enormous potential to create more accountable, efficient, responsive, and effective governments and businesses, and to spur economic growth.

    Open data sit at the heart of this global movement.

    2. Access to data allows individuals and organisations to develop new insights and innovations that can improve the lives of others and help to improve the flow of information within and between countries. While governments and businesses collect a wide range of data, they do not always share these data in ways that are easily discoverable, useable, or understandable by the public.

    This is a missed opportunity.

    3. Today, people expect to be able to access
    information and services electronically when and how they want. Increasingly, this is true of government data as well. We have arrived at a tipping point, heralding a new era in which people can use open data to generate insights, ideas, and services to create a better world for all.

    4. Open data can increase transparency about what government and business are doing. Open data also increase awareness about how countries’ natural resources are used, how extractives revenues are spent, and how land is transacted and managed. All of which promotes accountability and good governance, enhances public debate, and helps to combat corruption. Transparent data on G8 development assistance are also essential for accountability.

    5. Providing access to government data can empower individuals, the media, civil society, and business to fuel better outcomes in public services such as health, education, public safety, environmental protection, and governance. Open data can do this by:
     ?showing how and where public money is spent, providing strong incentives for that money to be used most effectively;
     ?enabling people to make better informed choices about the services they receive and the standards they should expect.

    6. Freely-available government data can be used in innovative ways to create useful tools and products that help people navigate modern life more easily. Used in this way, open
    data are a catalyst for innovation in the private sector, supporting the creation of new markets, businesses, and jobs. Beyond government, these benefits can multiply as more businesses adopt open data practices modelled by government and share their own data with the public.

    7. We, the G8, agree that open data are an untapped resource with huge potential to encourage the building of stronger, more interconnected societies that better meet the needs of our citizens and allow innovation and prosperity to flourish.

    8. We therefore agree to follow a set of principles that will be the foundation for access to, and the release and re-use of, data made available by G8 governments. They are:

    9. While working within our national political and legal frameworks, we will implement these principles in accordance with the technical best practises and timeframes set out in our national action plans. G8 members will, by the end of this year, develop action plans, with a view to implementation of the Charter and technical annex by the end of 2015 at the latest. We will review progress at our next meeting in 2014.

    10. We also recognise the benefits of open data can and should be enjoyed by citizens of all nations. In the spirit of openness we offer this Open Data Charter for consideration by other countries, multinational organisations and initiatives.

    Principle 1: Open Data by Default

    11. We recognise that free access to, and subsequent re-use of, open data are of significant value to society and the economy.
    12. We agree to orient our governments towards open data by default.
    13. We recognise that the term government data is meant in the widest sense possible. This could apply to data owned by national, federal, local, or international government bodies, or by the wider public sector.
    14. We recognise that there is national and international legislation, in particular pertaining to intellectual property, personally-identifiable and sensitive information, which must be observed.
    15. We will: establish an expectation that all government data be published openly by default, as outlined in this Charter, while recognising that there are legitimate reasons why some data cannot be released.

    Principle 2: Quality and Quantity

    16. We recognise that governments and the public sector hold vast amounts of information that may be of interest to citizens.
    17. We also recognise that it may take time to prepare high-quality data, and the importance of consulting with each other and with national, and wider, open data users to identify which data to prioritise for release or improvement.
    18. We will:
    - release high-quality open data that are timely, comprehensive, and accurate. To the extent possible, data will be in their original, unmodified form and at the finest level of granularity available;
     ensure that information in the data is written in plain, clear language, so that it can be understood by all, though this Charter does not require translation into other languages;
     make sure that data are fully described, so that consumers have sufficient information to understand their strengths, weaknesses, analytical limitations, and security requirements, as well as how to process the data; and
     release data as early as possible, allow users to provide feedback, and then continue to make revisions to ensure the highest standards of open data quality are met.

    Principle 3: Usable by All

    19. We agree to release data in a way that helps all people to obtain and re-use it.
    20. We recognise that open data should be available free of charge in order to encourage their most widespread use.
    21. We agree that when open data are released, it should be done without bureaucratic or administrative barriers, such as registration requirements, which can deter people from accessing the data.
    22. We will:
     release data in open formats wherever possible, ensuring that the data are available to the widest range of users for the widest range of purposes; and
     release as much data as possible, and where it is not possible to offer free access at present, promote the benefits and encourage the allowance of free access to data. In many cases this will include providing data in multiple formats, so that they can be processed by computers and understood by people.

    Principle 4: Releasing Data for Improved Governance

    23. We recognise that the release of open data strengthens our democratic institutions and encourages better policy-making to meets the needs of our citizens. This is true not only in our own countries but across the world.
    24. We also recognise that interest in open data is growing in other multilateral organisations and initiatives.
    25. We will:
     share technical expertise and experience with each other and with other countries across the world so that everyone can reap the benefits of open data; and
    be transparent about our own data collection, standards, and publishing processes, by documenting all of these related processes online.

    Principle 5: Releasing Data for Innovation

    26. Recognising the importance of diversity in stimulating creativity and innovation, we agree that the more people and organisations that use our data, the greater the social and economic benefits that will be generated. This is true for both commercial and non-commercial uses.
    27. We will:
     work to increase open data literacy and encourage people, such as developers of applications and civil society organisations that work in the field of open data promotion, to unlock the value of open data;
     empower a future generation of data innovators by providing data in machine-readable formats.
  • Data challenges
    Volume: the main challenge is how to deal with the size of big data.

    Variety: combining multiple data sets: the challenge is how to handle multiplicity of types, sources and formats.

    Velocity: one of the key challenges is how to react to the flood of information in the time required by the application.

    Veracity: data quality, data availability: How can we cope with uncertainty, imprecision, missing values, misstatements or untruths? How good is the data? How broad is the coverage? How fine is the sampling resolution? How timely are the readings? How well understood are the sampling biases? Is there data available, at all?

    Data discovery: this is a huge challenge: how to find high-quality data from the vast collections of data that are out there on the Web?

    Quality and relevance: the challenge is determining the quality of data sets and relevance to particular issues (i.e. is the data set making some underlying assumption that renders it biased or not informative for a particular question).

    Data comprehensiveness: are there areas without coverage? What are the implications?

    Personally identifiable information: Can we extract enough information to help people without extracting so much as to compromise their privacy?

    Data dogmatism: Analysis of big data can offer quite remarkable insights, but we must be wary of relying too much on the numbers. Domain experts and common sense must continue to play a role.

    Scalability: This includes according to Shilpa Lawande, VP Engineering at analytics platform provider Vertica: “techniques like social graph analysis, for instance leveraging the influencers in a social network to create better user experience are hard problems to solve at scale. All of these problems combined create a perfect storm of challenges and opportunities to create faster, cheaper and better solutions for big data analytics than traditional approaches can solve.”
  • Process challenges

    A major challenge in this context is how to analyse. Shilpa Lawande from Vertica explained that “It can take significant exploration to find the right model for analysis, and the ability to iterate very quickly and ‘fail fast’ through many (possible throw away) models – at scale – is critical.”
    According to Laura Haas from IBM Research, process challenges in regard to deriving insights include:

    Capturing data

    Aligning data from different sources (e.g., resolving when two objects are the same)

    Transforming the data into a form suitable for analysis

    Modelling it, whether mathematically, or through some form of simulation
    Understanding the output, visualizing and sharing the results, considering how to display complex analytics on a mobile device.
  • Management challenges

    The main management challenges are related to data privacy, security, governance, and ethical issues. The main management related challenges are ensuring that data is used correctly, which means abiding by its intended uses and relevant laws, tracking how the data is used, transformed and derived, as well as managing its lifecycle.
  • SRIA Draft - http://www.bigdatavalue.eu/index.php/downloads/finish/3-big-data-value/14-big-data-value-strategic-research-and-innovation-agenda/0
  • R A Longhorn Presentation at Taiwan Open Data Forum, Taipei, 9 July 2014

    1. 1. OPEN DATA meets BIG DATA: Issues and Challenges Roger Longhorn Operations & Communications Director, GSDI Association Founder Member, IGS ral@alum.mit.edu
    2. 2. The Presentation  What is Open Data?  Open Data Challenges  What is Big Data?  Big Data Key Challenges  Research needs 09 July 2014 2014 Open Data Forum, Taipei 2
    3. 3. What is Open Data? Open Definition from the Open Knowledge Foundation: principles that define “openness” in relation to data and content, precisely defines “open” in the terms “open data” and “open content”, ensures interoperability (shared access) between different collections of open material. http://opendefinition.org/okd/ http://okfn.org/ 09 July 2014 2014 Open Data Forum, Taipei 3
    4. 4. What is Open Data? “A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” http://opendefinition.org/od/ http://okfn.org/ 09 July 2014 2014 Open Data Forum, Taipei 4
    5. 5. OKN’s Open Data Definition The Open Knowledge Foundation’s definition covers: • Access • Redistribution • Reuse • Absence of Technological Restriction • Attribution • Integrity • No Discrimination Against Persons or Groups • Distribution of License • License Must Not Be Specific to a Package • License Must Not Restrict the Distribution of Other Works http://opendefinition.org/od/ 09 July 2014 2014 Open Data Forum, Taipei 5
    6. 6. Open Data Census Global Census Facts at 2014 Number of countries = 70 Number of datasets = 700 Number of open datasets = 84 Percentage open = 12% From the Open Data Index https://index.okfn.org/ 09 July 2014 2014 Open Data Forum, Taipei 6
    7. 7. G8 Open Data Charter 09 July 2014 2014 Open Data Forum, Taipei 7  Principle 1 – Open Data by default  Principle 2: Quality and Quantity  Principle 3: Usable by All  Principle 4: Releasing Data for Improved Governance  Principle 5: Releasing Data for Innovation
    8. 8. Open Data Challenges 09 July 2014 2014 Open Data Forum, Taipei 8 1. What data should be made public? 2. How to make data publicly ‘open’? 3. How to efficiently implement and monitor Open Data policy? 4. How to judge the effectiveness of an Open Data policy?
    9. 9. What Data Should Be Public? 09 July 2014 2014 Open Data Forum, Taipei 9 1. Economic drivers  Recent studies reveal the value to economies of opening up public datasets for unrestricted use, including commercially. 2. Principles for governance of society  Reactive versus proactive release of government data?  Privacy concerns  Existing regulations
    10. 10. Making Data Publicly ‘Open’ 09 July 2014 2014 Open Data Forum, Taipei 10 1. Agreeing data (& service) standards • … and introducing them. 2. Setting appropriate policies • … and enacting them. 3. Promulgating regulations • … and enforcing them. Lessons learned from the EU’s PSI Re-use Directive(s) (2003 and 2013)
    11. 11. Monitoring Policy 09 July 2014 2014 Open Data Forum, Taipei 11 1. Monitoring Open Data  Should you monitor Open Data policy  Can you monitor Open Data policy? 2. Implementing policy  Voluntary v. mandatory  Regulations?  Handling infringements 3. Technology  For monitoring and reporting
    12. 12. Judging Effectiveness 09 July 2014 2014 Open Data Forum, Taipei 12 1. How to judge the effectiveness of a government’s Open Data policy? 2. Defining ‘effectiveness’ • Benefits for government, society and businesses • Cost-Benefit Analysis – feasible? • Identifying tangible v. intangible benefits 3. What ‘indicators of success’ to use • Some will be financial • Many (intangibles) will be difficult to measure
    13. 13. What Is Big Data? “2,500,000,000,000,000,000 Bytes (2.5 x 1018) of data are created every day!” (2012_ or 8,000,000,000,000,000,000 “(7 exabytes) of new data were stored globally by enterprises in 2010” Source: McKinsey Global Institute 09 July 2014 2014 Open Data Forum, Taipei 13
    14. 14. The Big Data Landscape 09 July 2014 2014 Open Data Forum, Taipei 14
    15. 15. The 5 Dimensions of Big Data 09 July 2014 2014 Open Data Forum, Taipei 15
    16. 16. Value of Big Data • In 2012, the world-wide Big Data market reached US$11.59 billion (exceeding previous forecasts). • 2013 a growth rate of over 60% was predicted, leading to a global Big Data market value of US$18.1 billion. • For 2012-2017, a 31% Compound Annual Growth Rate (CAGR) was calculated. • Predicting global Big Data market to exceed US$47 billion by 2017 Sources: Jeff Kelly et al: Big Data Vendor Revenue and market Forecast 2012-2017 (2013) International Data Corporation (IDC): Worldwide Big Data Technology and Services 2012-2016 Forecast (2012) 09 July 2014 2014 Open Data Forum, Taipei 16
    17. 17. Value of Big Data • Big Data is “the next frontier for innovation, competition and productivity”. • The impact of Big Data provides huge potential for competition and growth for individual companies. • The right use of Big Data can increase productivity, innovation, and competitiveness for entire sectors and economies. McKinsey Global Institute, Big Data: The next frontier for innovation, competition and productivity 09 July 2014 2014 Open Data Forum, Taipei 17
    18. 18. Big Data Challenges Three Big Data Challenges • Data Challenges • Process Challenges • Management Challenges 09 July 2014 2014 Open Data Forum, Taipei 18
    19. 19. Data Challenges • Volume • Variety • Velocity • Veracity • Data discovery • Quality and relevance • Data comprehensiveness • Personally identifiable information • Data dogmatism • Scalability 09 July 2014 2014 Open Data Forum, Taipei 19
    20. 20. Process Challenges • Capturing data • Aligning data from different sources • Transforming the data into a form suitable for analysis • Modelling it, either mathematically or via simulation • Understanding the output – visualizing and sharing the results, – how to display complex analytics on a mobile device. 09 July 2014 2014 Open Data Forum, Taipei 20
    21. 21. Management Challenges • Skills development • Data privacy • Security & Governance • tracking how the data is used, transformed and derived • Ethical issues – ensuring that data is used correctly – abiding by its intended uses and relevant laws • Managing its lifecycle 09 July 2014 2014 Open Data Forum, Taipei 21
    22. 22. Big Data meets Open Data • Identifying the ‘right’ Big Data to provide as Open Data, – Why is this data needed? – Who needs it? – How can it be processed and used? • Overcoming data access and connectivity challenges – Especially relating to interoperability issues for multiple Big Data datasets (including those collected in real time) – Especially if these are not all fully open or follow different Open Data policies; 09 July 2014 2014 Open Data Forum, Taipei 22
    23. 23. Big Data meets Open Data • Making best use of Big Data – Working across multiple functions (IT, engineering, finance, procurement) – Overcoming the fragmented ownership of Big Data (custodianship, IPR, licensing, etc.); • Resolving security concerns – Data protection (for data owners) – Privacy (for personal data) – Potential misuse (which can raise liability issues) 09 July 2014 2014 Open Data Forum, Taipei 23
    24. 24. The Research Agenda • European Big Data Value Strategic Research & Innovation Agenda – “The objective of the SRIA is to describe the main research challenges and needs for advancing Big Data Value in Europe in the next 5 to 10 years.” • USA - Big Data Research Initiative – “cross-agency plans and research efforts to extract knowledge and insights from large and complex collections of digital data.” • NSF to direct efforts to – develop new methods to derive knowledge from data; – construct new infrastructure to manage, curate and serve data to communities; and – forge new approaches for associated education and training. 09 July 2014 2014 Open Data Forum, Taipei 24
    25. 25. Thank You! Roger Longhorn Operations & Communications Director, GSDI Association Founder Member, IGS ral@alum.mit.edu 09 July 2014 2014 Open Data Forum, Taipei 25