Big&open data challenges for smartcity-PIC2014 Shanghai


Published on

This talk is about how both private enterprise and government wish to improve the value of their data and how they deal with this issue. The talk summarizes the ways we think about Big Data, Open Data and their use by organizations or individuals. Big Data is explained in terms of collection, storage, analysis and valuation. This data is collected from numerous sources including networks of sensors, government data holdings, company market databases, and public profiles on social networking sites. Organizations use many data analysis techniques to study both structured and unstructured data. Due to volume, velocity and variety of data, some specific techniques have been developed. MapReduce, Hadoop and other related as RHadoop are trendy topics nowadays.
In this talk several applications and case studies are presented as examples. Data which come from government sources must be open. Every day more and more cities and countries are opening their data. Open Data is then presented as a specific case of public data with a special role in Smartcity. The main goal of Big and Open Data in Smartcity is to develop systems which can be useful for citizens. In this sense RMap (Mapa de Recursos) is shown as an Open Data application, an open system for Madrid City Council, available for smartphones and totally developed by the researching group G-TeC (

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • GRASIA: Agentes inteligentes e ingeniería del software
  • Esta plantilla se puede usar como archivo de inicio para proporcionar actualizaciones de los hitos del proyecto.

    Para agregar secciones, haga clic con el botón secundario del mouse en una diapositiva. Las secciones pueden ayudarle a organizar las diapositivas o a facilitar la colaboración entre varios autores.

    Use la sección Notas para las notas de entrega o para proporcionar detalles adicionales al público. Vea las notas en la vista Presentación durante la presentación.
    Tenga en cuenta el tamaño de la fuente (es importante para la accesibilidad, visibilidad, grabación en vídeo y producción en línea)

    Colores coordinados
    Preste especial atención a los gráficos, diagramas y cuadros de texto.
    Tenga en cuenta que los asistentes imprimirán en blanco y negro o escala de grises. Ejecute una prueba de impresión para asegurarse de que los colores son los correctos cuando se imprime en blanco y negro puros y escala de grises.

    Gráficos y tablas
    En breve: si es posible, use colores y estilos uniformes y que no distraigan.
    Etiquete todos los gráficos y tablas.

  • ¿Cuáles son las dependencias que afectan a la escala de tiempo, costo y resultado de este proyecto?
  • Este Esta presentación, que se recomienda ver en modo de presentación, muestra las nuevas funciones de PowerPoint. Estas diapositivas están diseñadas para ofrecerle excelentes ideas para las presentaciones que creará en PowerPoint 2010.

    Para obtener más plantillas de muestra, haga clic en la pestaña Archivo y después, en la ficha Nuevo, haga clic en Plantillas de muestra.
  • Big&open data challenges for smartcity-PIC2014 Shanghai

    1. 1. Big and Open Data Challenges for Smartcity Victoria López Grupo G-TeC Universidad Complutense de Madrid
    2. 2. Big and Open data. Challenges for Smartcity • Introduction • Fighting with Big Data: Genoma Data • Big Data. Big Projects • Open Data. Technology Transfer Opportunities • Smartcity. Big and Open Systems • Madrid as Smartcity • Conclusions 2
    3. 3. Introduction Our Goal: to transfer technology and knowledge – Mobile technologies applyed to environment – Intelligent agents – Optimization and forecasting from data – Bioinformatics, Biostatistics G-TeC group: statisticians, physicists, mathematicians, economists and several computer scientists. –
    4. 4. Fighting with the Big Data • Every day we need to deal with more and more data. • For many years, new computers with more memory and higher speed seem to be the solution for data growing (Elephant vendors). • Many researching areas which was fighting with the Big Data: Bioinformatics, Genoma data, DNA, RNA, proteins and, in general all biological data have been required by computing monitors and storing in large data bases in several laboratories and researching centers along the world. The future of genomics rests on the foundation of the Human Genome Project4
    5. 5. Fighting with the Big Data • Each time an organization or an individual is not able to deal with data, a big data problem is facing. • Human Genoma Project managed with same philosophy than modern Big Data: large data bases distributed along the world with parallel processing when available and suitable. • Our experience: Sequence alignment and its optimization with Dynamic Programming and their heuristics. • The amount of biological data is a Big Data base. • Adding new sequences, searching and forecasting are task very similar than those we face in every Big Data problem. 5
    6. 6. 22/05/2014 Vineyards in La Geria, Lanzarote 6 Case of Use. Looking for a Fungus • Application to infections in agricultural crops when it is no possible to identify the real fungus. • The responsible needs to make decisions about what to do, what medicine apply, or procedure is better. – A fragment of fungus DNA must be sequenced in the lab. – Then the scientist looks for it in molecular data bases by means of sequence searching (“DB homology search”). – Some alignment algorithms (Blast, Fasta) are executed to return the best matches. • gtttacgctctacaaccctttgtgaacatacctacaactgttg cttcggcgggtagggtctccgcgaccctcccggcctcccgcct ccgggcgggtcggcgcccgccggaggataaccaaactctgatt taacgacgtttcttctgagtggtacaagcaaataatcaaaact tttaacaaccggatctcttggttctggcatcgatgaagaacgc agcgaaatgcgataagtaatgtgaat The sequence
    7. 7. 22/05/2014 7 1. EBI: European Bioinformatics Institute 2. Choose the tools available into the web site a. Fasta3  b. Select DATABASE: • Nucleic ACIDS • FUNGI c. Fit sequences and run queries 3. A sorted list (but not complete) from better to worst similarity is returned. Data Base and Algorithm Selection PIC 2014, Shanghai Case of Use
    8. 8. 22/05/2014 8 EBI Web Site PIC 2014, Shanghai Case of Use
    9. 9. 22/05/2014 PIC 2014, Shanghai 9 Web Toolbox in EBI Case of Use
    10. 10. 22/05/2014 10 Algorithm Fasta 3 PIC 2014, Shanghai Case of Use
    11. 11. 22/05/2014 11 DATABASES NUCLEIC ACIDS: FUNGI PIC 2014, Shanghai Case of Use
    12. 12. 22/05/2014 12 Fit sequences and run FASTA 3 PIC 2014, Shanghai Case of Use
    13. 13. 22/05/2014 13 The output • FASTA searches a protein or DNA sequence data bank • version 3.3t09 May 18, 2001 • Please cite: • W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 • @:1-: 241 nt • • vs EMBL Fungi library • searching /ebi/services/idata/v225/fastadb/em_fun library • 104701680 residues in 66478 sequences • statistics extrapolated from 60000 to 61164 sequences • Expectation_n fit: rho(ln(x))= -1.2290+/-0.000361; mu= 72.1313+/- 0.026 • mean_var=907.6270+/-295.007, 0's: 68 Z-trim: 4246 B-trim: 15652 in 3/79 • Lambda= 0.0426 • FASTA (3.39 May 2001) function [optimized, +5/-4 matrix (5:-4)] ktup: 6 • join: 48, opt: 33, gap-pen: -16/ -4, width: 16 • Scan time: 3.180 • The best scores are: opt bits E(61164) • EM_FUN:CGL301988 AJ301988.1 Colletotrichum glo (1484) [f] 1184 88 5.7e-17 • EM_FUN:AF090855 AF090855.1 Colletotrichum gloe ( 500) [f] 1205 88 7.3e-17 • EM_FUN:CGL301986 AJ301986.1 Colletotrichum glo (1484) [f] 1166 87 1.2e-16 • EM_FUN:CGL301908 AJ301908.1 Colletotrichum glo (2868) [f] 1148 87 1.3e-16 • EM_FUN:CGL301909 AJ301909.1 Colletotrichum glo (2868) [f] 1148 87 1.3e-16 • EM_FUN:CGL301907 AJ301907.1 Colletotrichum glo (2867) [f] 1148 87 1.3e-16 • EM_FUN:CGL301919 AJ301919.1 Colletotrichum glo (1171) [f] 1166 87 1.6e-16 • EM_FUN:CGL301977 AJ301977.1 Colletotrichum glo (1876) [f] 1148 86 2e-16 • EM_FUN:CFR301912 AJ301912.1 Colletotrichum fra (2870) [f] 1137 86 2.1e-16 PIC 2014, Shanghai Case of Use
    14. 14. Our background about Bioinformatics • Bioinformatics (Master in researching in Informatics, UCM) • Several Master Thesis & publications – Alignment of sequences with R and Rhadoop* – Analysis & Visualization with R Language and Chernoff faces – Others 14
    15. 15. Big Data From Data Warehouse to Big Data (large Data Bases) 15 1970 relational model invented RDBMS declared mainstream till 90s One-size fits all, Elephant vendors- heavily encoded even indexing by B-trees.
    16. 16. Alex ' Sandy' Pentland, director of 'Media Lab' at Massachusetts Institute of Technology (MIT): The big data revolution, 2013 Campus Party Europe 16 Nowadays bussiness needs a high avalailability of data, then new techniques must be developed: Complex analytics, Graph Databases Data Volume is increasing exponentially – 44x increase from 2009 2020 – From 0.8 zettabytes to 35zb
    17. 17. unstructured data 17 ¿Quién genera Big Data? Progress and innovation are no longer hampered by the ability to collect data, but the ability to manage, analyze, synthesize, visualize, and discover knowledge from data collected in a timely manner and in a scalable way
    18. 18. Big Data Big Data 3+1+1 V’s 18
    19. 19. From data to value • Big Data Collection – Monitoring – Data cleaning and integration – Hosted Data Platforms and the Cloud • Big Data Storage – Modern Data Bases – Distributed Computing Platforms – NoSQL, NewSQL • Big Data Systems – Security – Multicore scalability – Visualization and User Interfaces • Big Data Analytics – Fast algorithms – Data compression – Machine learning tools – Visualization & Reporting 19 The MIT proposal stage list to deal with Big Data
    20. 20. Big Data in use 1. High Availability is now a requirement 2. Host (not only in house) and Cloudcomputing 3. Running in parallel 1. Data Aggregation process 2. Analytics on Data 3. GraphDBMSs similarities 4. Not only SQL: Cassandra* and MongoDB** *The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. **Document oriented storage 20 MONGO
    21. 21. 21 • Main feature: scalability to many nodes – Scan of 100 TB in 1 node @ 50 MB/sec = 23 days – Scan in a cluster of 1000 nodes = 33 minutes MapReduce – Parallel programming model – Simple concept, smart, suitable for multiple applications – Big datasets  multi-node in multiprocessors – Sets of nodes: Clusters or Grids (distributed programming) • By Google (2004) – Able to process 20 PB per day – Based on Map & Reduce, classiclal methods in functional programming related to the classic divide & conquer – Come from numeric analysis (big matrix products). Big Data: Map Reduce MapReduce
    22. 22. • Friendly for non technical users Map Reduce 22 Big Data: Map Reduce
    23. 23. – UsedbyYahoo!,Facebook,Twitter Amazon,eBay… – Canbeusedindifferentarchitectures: bothclusters(in-house)andgrid (Cloudcomputing) – StrormandSparkaresamemodel“in memory”insteadofindisk Hadoop 23 Big Data: Hadoop
    24. 24. More technical information •
    25. 25. Technology Transfer Opportunities • A great opportunity for researchers working to transfer technology, who can increase their efforts in developing new techniques in optimization of: – Monitoring data (Sensors, smartphones, …) – Storing data (Cloud Computing, Amazon S3, EC2, Google BigQuery, Tableau …) – Cleaning, Integrating & Processing data (Data Curation at Scale: The Data Tamer System, M. Stonebraker et al., CIDR 2013) – Analysing data (R, SAS… but also Google, Amazon, eBay...) – Encryption & searching on encrypted data – Techniques of Data Mining (Machine Learning, Data Clustering, Predictive Models, ...) which are compatible with big data by complex analytics 25
    26. 26. Big Data. Big Projects. • Google • eBay • Amazon • Twitter • … • They develop big projects with their big data, but also many business get their data to make analysis. • Government data. Public data. 26
    27. 27. Working with Big Data in G-TeC group
    28. 28. 28
    29. 29. Academia & Industry Working Together OMUS Industry know-how and expertise Data Collection Big Data and Analytics Patents, Intellectual Property and other output Doctoral Thesis: joint guidance University Theoretical Models & Research
    30. 30. Open Data “Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.” - “Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and share alike.” Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form. Reuse and Redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets. The data must be machine- readable. Universal Participation: everyone must be able to use, reuse and redistribute – there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed. 30
    31. 31. Open Data 31
    32. 32. Why Open Data by Open Knowledge Foundation 32
    33. 33. Open Data for Smartcity • What a citizen can expect when living in a city? • Internet of the things – Libraries – Public transportation, trafic monitoring – Pets, devices, cars, even people • Intelligent agents – Interacting without our control – Credit cards control (BBVA case of use) 33
    34. 34. C-KAN • The Comprehensive Knowledge Archive Network (CKAN) is a web-based open source data management system for the storage and distribution of data, such as spreadsheets and the contents of databases. It is inspired by the package management capabilities common to open source operating systems like Linux. 34 • Its code base is maintained by the Open Knowledge Foundation. • The system is used both as a public platform on Datahub and in various government data catalogues (UK's, the Dutch National Data Register, the United States government's and the Australian government's "Gov 2.0“)
    35. 35. Basic structure Patrón Cliente/Servidor PUBLIC DATA Web Service SERVER CLIENT WEB SERVER 35
    36. 36. Smartcity concept • Large amount of people. Big cities. – Search 7 thousand differences • Smartcity business. • The role of technology in the city: efficiency & security • Normalization of the concept of Smartcity (May, 2014) – Better quality of life. Security – Sustainability – Innovation opportunities – Multidiscipline: social researchers, engineers, architects, … • Relationships are in change. Based on mobile technologies (smartphones, tablets, internet of the things,…) • Transverse developing projects: sensors and monitoring devices, connectivity, platform, services in the cloud. 36
    37. 37. Smartcity concept • Large amount of non structured information • Machine learning, big data technologies, internet of the things, intelligent systems are needed. • Technology development as a service in all areas: 1. Structure: – Environment, infrastructure (water, energy, material, mobility, nature), built domain 2. Society: – pubic space, functions, people 3. Data: – information flows, performance 37
    38. 38. Mariam Saucedo Pilar Torralbo Daniel Sanz Ana Alfaro Sergio Ballesteros Lidia Sesma Héctor Martos Álvaro Bustillo Arturo Callejo Belén Abellanas Jaime Ramos Ignacio P. de Ziriza Victor Torres Alberto Segovia Miguel Bueno Mar Octavio de Toledo Antonio Sanmartín Carlos Fernández MAPA DE RECURSOS RECYCLA.TE 38
    39. 39. • Parks and gardens • Parkings for • Cars • Motorbikes • Bikes • Recycing Points • Fixed • Mobile • Cloths • Stations • Bioetanol • Gas • Oil • Electric • Routes for bikes • Vías ciclistas • Calles seguras • Residential Priority Areas Madrid – Smart City 39
    40. 40. 40
    42. 42. 42
    43. 43. Data Analytics, Data Scientist FROM (UNSTRUCTURED) DATA TO VALUE 43
    44. 44. •PIC 2014 MyConference
    45. 45. Be ready at PIC 2014 with MyConference Main Menu Access to Committees Venue and localization Extra Information
    46. 46.
    47. 47. Conclusions 47 Big Data, Open Data and Smartcity • A great opportunity for researchers working to transfer technology, who can increase their efforts in developing new techniques in optimization of: – Monitoring data – Storing data – Cleaning, Integrating & Processing data – Analysing data – Encryption & searching on encrypted data – Techniques of Data Mining • A great future work in relation to development new smart cities in environment, security and infrastructures.
    48. 48. Big and Open Data Challenges for Smartcity Victoria López Grupo G-TeC Universidad Complutense de Madrid