Edyra
- 1. ResUbic Research Seminar
ResUbic Research Lab Dresden
EDYRA Engineering of Do-it-Yourself Analytic
Rich Internet Applications
Wolfgang Lehner
Maik Thiele
Katrin Braunschweig
Julian Eberius
© Prof. Dr. -Ing. Wolfgang Lehner
- 2. >
MAD Skills
[Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton:
MAD Skills: New Analysis Practices for Big Data. PVLDB 2009]
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar | 2
- 3. > Motivation (1)
In the days of Kings
and Priests
Computers and Data: Crown Jewels
Executives depend on computers
But cannot work with them directly
The DBA “Priesthood”
And their Acronymia: EDW, BI, OLAP
The architected Enterprise DWH
Rational behavior…for a bygone era
“There is no point in bringing data … into the
data warehouse environment without
integrating it.”
—Bill Inmon, Building the Data Warehouse,
2005
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 3
- 4. > Motivation (2)
New Realities
TB disks < $100
Everything is data
Rise of data-driven culture
Very publicly espoused by Google,
Wired, etc.
Sloan Digital Sky Survey, Terraserver, etc.
The quest for knowledge used
to begin with grand theories.
Now it begins with massive
amounts of data.
Welcome to the Petabyte
Age.
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 4
- 5. > MAD Skills
Magnetic
„Attract data and practitioners“
Usage of all data source
independet of their data
quality
Agile
„Rapid iteration: ingest, analyze, productionalize“
Continous evolution of the logical and physical
structures
ELT (Extraction, Loading, Transformation)
Deep
„Sophisticated analytics in Big Data“
Extended algorithmic run-time
Ad-hoc advanced analytics and statistics
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 5
- 6. > Open Data, Services and Mashups
Web of Data
E-Government 2.0, Initiative i2010
Europeana, World Digital Library
Public data catalogs
http://data.gov/
http://data.gov.uk/
Free to
Copy, distribute and transmit the data
Adapt the data
Exploiting the data commercially, whether by sub-licensing it, combining it with other
data, or by including it in your own product
Web of Services
OpenSocial-API (Google, Yahoo!, MySpace, Xing)
Scientific Computations (http://www.wolframalpha.com)
Entitiy Detection (http://www.yooname.com)
Visualization (http://manyeyes.alphaworks.ibm.com/manyeyes)
Web of Mashups
Programmale Web (http://www.programmableweb.com/)
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 6
- 7. > Principles of Open Data
Data shall be considered open if it is made public in a way that complies with
the principles below
Complete: All public data is made available. Public data is data that is not subject to valid privacy,
security or privilege limitations.
Primary: Data is as collected at the source, with the highest possible level of granularity, not in
aggregate or modified forms
Timely: Data is made available as quickly as necessary to preserve the value of the data.
Accessible: Data is available to the widest range of users for the widest range of purposes.
Machine processable: Data is reasonably structured to allow automated processing.
Non-discriminatory: Data is available to anyone, with no requirement of registration.
Non-proprietary: Data is available in a format over which no entity has exclusive control.
License-free: Data is not subject to any copyright, patent, trademark or trade secret regulation.
Reasonable privacy, security and privilege restrictions may be allowed.
Quelle: http://resource.org/8_principles.html
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 7
- 8. >
„Daten gehören den Menschen“ – typische Beispiele: Genome, Daten von
Organismen, medizinische Forschung, umweltwissenschaftliche Daten
öffentliche Gelder haben die Generierung der Daten erst ermöglicht, also
müssen sie auch öffentlich zugänglich sein (tatsächlich treten Wissenschaftler
in der Regel die Rechte an den von ihnen generierten Daten an private Verlage
ab, wenn sie ihre Ergebnisse publizieren)
Fakten können nicht dem Urheberrecht unterliegen
Forschung wird gefördert, wenn wissenschaftliche Erkenntnisse für alle
Forscher frei zugänglich sind
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 8
- 10. > Gapminder (2)
Vision: making sense of the world by having fun with statistics!
Gapminder is a non-profit venture for development and provision of free software to
visualize human development trends
Gapminder will ultimately be integrated into Google: this is the first time global
datasets will be searchable over the Internet
Hans Rosling @ TED
TEDTalks: annual technology conference in California, USA
http://www.ted.com/tedtalks/
Hans Rosling is a professor of global health at the Karolinska Institute, data
visualization extraordinaire and the creator of the Gapminder tools
see http://www.youtube.com/watch?v=YpKbO6O3O3M
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 10
- 11. > Public.Resource.Org
Idea: Make government more transparent
Project funded: Public.Resource.Org is a non-profit organization focused on
enabling online access to public government documents in the United States.
We are providing $2 million to Public.Resource.Org to support the Law.Gov
initiative, which aims to make all primary legal materials in the United States
available to all.
Gewinner des Projekts 10100
http://www.project10tothe100.com/intl/DE/index.html
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 11
- 12. > Microsoft’s Open Government Data Initiative
• The Open Government Data Initiative (OGDI) is a cloud-based
collection of software assets that enables publicly available government
data to be easily accessible. Using open standards and application
programming interfaces (API), developers and government agencies can
retrieve the data programmatically for use in new and innovative online
applications, or mash-ups that can help:
– Improve citizen services
– Enhance collaboration between
government agencies and private organizations
– Increase government transparency
• OGDI promotes the use of this data by capturing and publishing re-
usable software assets, patterns, and practices. The data repository
already holds over 60 different government datasets that are readily
available for use in new applications, and is continuously updated with
additional government datasets.
• More: http://www.microsoft.com/industry/government/opengovdata/
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 12
- 13. > Civic Commons
http://civiccommons.com/
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 13
- 17. > unData
http://data.un.org/
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 17
- 18. > Ushahidi
http://www.ushahidi.com/
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 18
- 19. > Statistisches Bundesamt Deutschland
https://www-genesis.destatis.de/genesis/online/
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 19
- 21. > Data360
http://www.data360.org
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 21
- 22. > IBM ManyEyes
http://manyeyes.alphaworks.ibm.com)/manyeyes/
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 22
- 23. > Open Citizen‘s Platform
Public issue tracking provides increased engagement, transparency, and
participation in the community
Manage issues in urban environments, like pot-holes, broken street lighting or
lack of accessibility
What are the benefits to…
Governments Citizens
Reduce time, effort and resources in Open access to complete, formatted data
fulfilling public information requests rather than relying on third party
Increase data quality by providing correct interpretations or subsets
data to public from the source Information accessibility leads to greater
Reduce duplication of effort government accountability
Increase data access, availability, and speed Fosters better community action on social
of delivery issues, e.g. crime, pollution, permits,
Improve citizen satisfaction and create accidents, and education
good public relations with your community Improves regional competitiveness by giving
businesses quicker and fuller access to data
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 23
- 24. > What are the goals of the project?
Long Term…
Build a open citizen platform for Dresden www.opendresden.de
Process it.. compare it... mix it.. filter it... visualize it…
Basic premises
Build a simple system and let it evolve
Design for participation
Openness
For now…
Start with a series of value-added municipal services (e.g.
Mapnificient, Schooloscope, Cycling Planner, see following slides)
Transport, Education, Economy, (Local) Politics, Environment, Entertainment
Promote the open data principle in Saxony
Develop a fluid data repository (for municipal data)
Design a domain specific language in order to integrate and analyze data
Different levels of abstraction
Reuse existing apps Visual dataflow languages Textual DSL editors
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 24
- 27. > Where can I live
http://www.where-can-i-live.com/londonproperty
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 27
- 28. > UBC/Google cycling planner
http://www.cyclevancouver.ubc.ca/cv.aspx
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 28
- 29. > CitySourced
http://www.citysourced.com
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 29
- 30. > EveryBlock
http://chicago.everyblock.com/
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 30
- 31. > Architecture – Sketch
Lightweight Integration Techniques
• Join across dimensions (e.g. Entity + Time REST Google
Public Data Sources
+ Place)
Visualization
Open Data and
Maps
• Aggregations
JSON
Lightweight Composite Applications
Openstreet
• Create information from the data Map
• Uncover hidden aspects of data KML
• Which becomes new data itself IBM
• Classification, prediction, clustering ManyEyes
GeoRSS
• Embrace recursion
API for location-based
collaborative issue-tracking
http://open311.org
http://www.omgstandard.com
Repository
Fluid Data
Citizen
Geo Data
Request‘s
Municipal
Data
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 31
- 32. > Fluid Data Repository
Platform for the web of things, each represented by an openly writable
„social“ object
Share, annotate, augment and re-use information
Mainly concerns data mediation and integration
Need to access and integrate data residing in multiple and heterogeneous
sources
Adaptive, add metrics, aggregations,
data sources or data connections
without re-building analysis processes
or visualizations “non-destructive
change”
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 32
- 33. > Alternative Data Models
BigTable
HBase RavenDB
SimpleDB
MongoDB OrientDB
Cassandra
CouchDB ThruDB
Hypertable
Column
Families Terrastore
Documents
FluidDB
other
Voldemort
NoSQL
Dynomite
Key/Value
Dynamo
Triple RedStore
Tokio Cabinet GT.M Stores
Viruoso
Redis
Graph
Scalaris Sones Jena
Sesame YARS
Pahoehoe
Riak
Neo4J
AllegroGraph
HyperGraphDB
FlockDB
© Prof. Dr.-Ing. Wolfgang Lehner| ResUbic Research Seminar 33