This document discusses using MapReduce and HDFS to efficiently process large remote sensing images in parallel. It provides context on the large volume of data from remote sensing (e.g. 1.2GB for a 1km resolution image with 0.5 billion pixels) and challenges of storage, transport and processing. It reviews literature on related projects processing large datasets and key concepts of HDFS for robust distributed storage and MapReduce for parallel processing. Finally, it outlines a planned approach involving initial simple algorithms and expanding to more complex spatial and temporal processing.
This document discusses HIPI, a computer vision framework for processing large image datasets in a distributed manner. It introduces HIPI's MapReduce-based workflow and how it implements image processing tasks at scale. Performance tests show that HIPI can redo principal component analysis on image datasets orders of magnitude larger than previous works. Optimizations like culling help improve performance by only decompressing necessary images. Overall, HIPI offers performance on par or better than alternatives and significant improvements for processing large collections of images.
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Cloudera, Inc.
The document discusses Skybox Imaging's approach to indexing the entire planet using distributed low-cost satellites and Hadoop. It notes that today's satellite imagery data is often years old. Skybox proposes to use a network of many small satellites that can generate over 1 terabyte of raw data per day. This massive amount of data would be stored and processed using Hadoop on the ground. Skybox has developed an approach called BusBoy to integrate Hadoop and native code for efficient scientific processing of the satellite imagery data.
The growth of the amount of medical image data produced on a daily basis in modern hospitals forces the adaptation of traditional medical image analysis and indexing approaches towards scalable solutions. In this work, MapReduce is used to speed up and make possible three large–scale medical image processing use–cases: (i) parameter optimization for lung texture classification using support vector machines (SVM), (ii) content–based medical image indexing, and (iii) three–dimensional directional wavelet analysis for solid texture classification.
The document discusses virtualizing Hadoop and provides details on:
1) Current and projected usage of Hadoop shows a trend toward more virtualized deployments on-premises and in the public cloud.
2) Different virtualization platform scenarios for Hadoop including using shared storage, storage appliances, or local disks. Common virtualization platforms that can be used are VMware vSphere and OpenStack.
3) Example virtualization architectures showing how Hadoop could be deployed with shared NAS storage or using direct-attached storage on each virtual machine.
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
This document provides an overview of Sahara, an OpenStack project that aims to simplify managing Hadoop infrastructure and tools. Sahara allows users to create and manage Hadoop clusters through a programmatic API or web console. It uses a plugin architecture where Hadoop distribution vendors can integrate their management software. Currently there are plugins for vanilla Apache Hadoop, Hortonworks Data Platform, and Intel Distribution for Apache Hadoop. The document outlines Sahara's architecture, APIs, roadmap, and demonstrates its use through a live demo analyzing transaction data with the BigPetStore sample application on Hadoop.
Ic Accounting Presentation 15 Min PresentationNigelDawes
This document discusses intellectual capital accounting and methods for measuring intangible assets. It outlines Karl-Erik Sveiby's four approaches to measuring intangibles: direct, market capitalization, return on assets, and scorecard methods. It also discusses challenges in measuring social phenomena and the need for new management tools to capture increased importance of intangible assets like customer relationships and innovation capabilities.
This document discusses HIPI, a computer vision framework for processing large image datasets in a distributed manner. It introduces HIPI's MapReduce-based workflow and how it implements image processing tasks at scale. Performance tests show that HIPI can redo principal component analysis on image datasets orders of magnitude larger than previous works. Optimizations like culling help improve performance by only decompressing necessary images. Overall, HIPI offers performance on par or better than alternatives and significant improvements for processing large collections of images.
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Cloudera, Inc.
The document discusses Skybox Imaging's approach to indexing the entire planet using distributed low-cost satellites and Hadoop. It notes that today's satellite imagery data is often years old. Skybox proposes to use a network of many small satellites that can generate over 1 terabyte of raw data per day. This massive amount of data would be stored and processed using Hadoop on the ground. Skybox has developed an approach called BusBoy to integrate Hadoop and native code for efficient scientific processing of the satellite imagery data.
The growth of the amount of medical image data produced on a daily basis in modern hospitals forces the adaptation of traditional medical image analysis and indexing approaches towards scalable solutions. In this work, MapReduce is used to speed up and make possible three large–scale medical image processing use–cases: (i) parameter optimization for lung texture classification using support vector machines (SVM), (ii) content–based medical image indexing, and (iii) three–dimensional directional wavelet analysis for solid texture classification.
The document discusses virtualizing Hadoop and provides details on:
1) Current and projected usage of Hadoop shows a trend toward more virtualized deployments on-premises and in the public cloud.
2) Different virtualization platform scenarios for Hadoop including using shared storage, storage appliances, or local disks. Common virtualization platforms that can be used are VMware vSphere and OpenStack.
3) Example virtualization architectures showing how Hadoop could be deployed with shared NAS storage or using direct-attached storage on each virtual machine.
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
This document provides an overview of Sahara, an OpenStack project that aims to simplify managing Hadoop infrastructure and tools. Sahara allows users to create and manage Hadoop clusters through a programmatic API or web console. It uses a plugin architecture where Hadoop distribution vendors can integrate their management software. Currently there are plugins for vanilla Apache Hadoop, Hortonworks Data Platform, and Intel Distribution for Apache Hadoop. The document outlines Sahara's architecture, APIs, roadmap, and demonstrates its use through a live demo analyzing transaction data with the BigPetStore sample application on Hadoop.
Ic Accounting Presentation 15 Min PresentationNigelDawes
This document discusses intellectual capital accounting and methods for measuring intangible assets. It outlines Karl-Erik Sveiby's four approaches to measuring intangibles: direct, market capitalization, return on assets, and scorecard methods. It also discusses challenges in measuring social phenomena and the need for new management tools to capture increased importance of intangible assets like customer relationships and innovation capabilities.
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageCloudera, Inc.
Learn about:
Why big data matters to your business: realize revenue, increase customer loyalty, and pinpoint effective strategies
The business and technical challenges of big data solutions
How to leverage big data for competitive advantage
The “must haves” of an effective big data solution
Real-world examples of Cloudera, Pentaho and Dell big data solutions in action
How to give a Creative Presentation in 10 minutes by Two pensCynthia Hartwig
This is a primer for people who are put under the gun to present creative work in on-the-fly situations. Ten minutes can be well spent, if you do Creative Director Cynthia Hartwig's tricks. Or work you've spent weeks designing, writing and creating, will be deep-sixed faster than you can remove yourself from the room. Includes tips on recapping the creative brief, appointing a Meeting Czar (someone who runs the meeting and drives decisions) and presenting so that everybody can see that little, teeny mouse-type (assuming it's important). This deck is a supplement to How to Rock a Presentation by Cynthia Hartwig at Two Pens.
If you would like to find out more, with a no obligation chat about Arbonne, then please get in touch :-)
ID:440017486
www.katherinebamford.myarbonne.co.uk
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
1. The Hadoop Image Processing (HIP) pipeline acquires vehicle images, identifies updates, generates URLs, crops and resizes images, copies them to asset servers, and removes duplicates.
2. It uses HBase for image storage and archiving, MapReduce for image processing, Kafka for publishing to asset servers, OpenCV for image processing, and Avro for data serialization.
3. Performance testing showed HIP scales linearly and is at least 10x faster than the previous system, and using cascading downloads provided a 20% performance gain.
Big Data - The 5 Vs Everyone Must KnowBernard Marr
This slide deck, by Big Data guru Bernard Marr, outlines the 5 Vs of big data. It describes in simple language what big data is, in terms of Volume, Velocity, Variety, Veracity and Value.
Scrum is an agile framework for managing product development that focuses on continuous delivery of working software in short cycles called sprints, typically two weeks or less. Scrum emphasizes self-organizing cross-functional teams and accountability, iterative development and progress transparency through regular inspection of working increments. Key Scrum practices include sprint planning, daily stand-up meetings, sprint reviews, and retrospectives. Scrum can scale to large, complex projects through techniques like Scrum of Scrums.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
The document discusses big data in astronomy and the LineA-DEXL case. It provides an outline and introduction to big data in science and hypothesis-driven research. It discusses data management techniques like data partitioning and parallel workflow processing. It then provides details on the Laboratorio Nacional de Computacao Cientifica (LNCC) and its role in supporting computational modeling and bioinformatics. It discusses astronomy surveys that generate large amounts of data like the Dark Energy Survey and challenges of data from the Large Synoptic Survey Telescope. Finally, it discusses the need for data infrastructure, metadata management, and distributed data management to support scientific research involving big data.
Building A Scalable Open Source Storage SolutionPhil Cryer
The Biodiversity Heritage Library (BHL), like many other projects within biodiversity informatics, maintains terabytes of data that must be safeguarded against loss. Further, a scalable and resilient infrastructure is required to enable continuous data interoperability, as BHL provides unique services to its community of users. This volume of data and associated availability requirements present significant challenges to a distributed organization like BHL, not only in funding capital equipment purchases, but also in ongoing system administration and maintenance. A new standardized system is required to bring new opportunities to collaborate on distributed services and processing across what will be geographically dispersed nodes. Such services and processing include taxon name finding, indexes or GUID/LSID services, distributed text mining, names reconciliation and other computationally intensive tasks, or tasks with high availability requirements.
The document lists the top 10 most notable open source projects on GitHub in 2012, including Ruby on Rails, CyanogenMod, CocoaPods, Symfony, Zend Framework, OpenStack Compute, Puppet, TrinityCore, Hubot scripts, and Amal Roumi's document is distributed under a Creative Commons license.
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
The document discusses big data analytics in the cloud, including definitions of big data and analytics. It covers technologies like Hadoop, Dremel, and Storm, and how they can be used for business intelligence, operational intelligence, and value creation. It also discusses architecture considerations for big data analytic systems in the cloud, including data transfer speeds. The presentation aims to provide an overview of approaches for near real-time business intelligence and analytics using these technologies, both their applicability and limitations when used in the cloud.
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
The computational requirements of next generation sequencing is placing a huge demand on IT organisations .
Building compute clusters is now a well understood and relatively straightforward problem. However, NGS sequencing applications require large amounts of storage, and high IO rates.
This talk details our approach for providing storage for next-gen sequencing applications.
Talk given at BIO-IT World, Europe, 2009.
This document discusses the intersection of machine learning and search-based software engineering (ML & SBSE). It provides examples of how data miners can find signals in software engineering artifacts using machine learning techniques. It then discusses how better algorithms do not necessarily lead to better mining yet and emphasizes the importance of sharing data, models, and analysis methods. Finally, it outlines a vision for "discussion mining" to guide teams in walking across the space of local models, with the goal of building a science of localism in ML and SBSE.
Viet-Trung Tran presents information on big data and cloud computing. The document discusses key concepts like what constitutes big data, popular big data management systems like Hadoop and NoSQL databases, and how cloud computing can enable big data processing by providing scalable infrastructure. Some benefits of running big data analytics on the cloud include cost reduction, rapid provisioning, and flexibility/scalability. However, big data may not always be suitable for the cloud due to issues like data security, latency requirements, and multi-tenancy overhead.
Viet-Trung Tran presents information on big data and cloud computing. The document discusses key concepts like what constitutes big data, popular big data management systems like Hadoop and NoSQL databases, and how cloud computing can enable big data processing by providing scalable infrastructure. Some benefits of running big data analytics on the cloud include cost reduction, rapid provisioning, and flexibility/scalability. However, big data may not always be suitable for the cloud due to issues like data security, latency requirements, and multi-tenancy overhead.
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows map-reduce performance across a variety of benchmarks like dfsio, pig-mix, nnbench, terasort and YCSB.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
From Sensor Data to Triples: Information Flow in Semantic Sensor NetworksNikolaos Konstantinou
Sensor networks constitute a technological approach of increasing popularity when it comes to monitoring an area, offering context-aware solutions. This shift from Desktop computing to Ubiquitous computing entails numerous options and challenges in designing, implementing and shaping the behavior of systems that consume, integrate, fuse and exploit sensor data. Things tend to be more complicated when, in order to extract meaning from the collected information, the Semantic Web paradigm is adopted. In this talk, we discuss the information flow in systems that collect sensor data into semantically annotated repositories. Specifically, we analyze the journey that information makes, from its capture as electromagnetic pulses by the sensors, to its storage as a semantic web triples, along with its semantics in the system’s knowledge base. We introduce the main related concepts, we analyze the main components that such systems comprise, the choices that can be made, and the respective benefits, drawbacks, and effect to the overall system properties.
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageCloudera, Inc.
Learn about:
Why big data matters to your business: realize revenue, increase customer loyalty, and pinpoint effective strategies
The business and technical challenges of big data solutions
How to leverage big data for competitive advantage
The “must haves” of an effective big data solution
Real-world examples of Cloudera, Pentaho and Dell big data solutions in action
How to give a Creative Presentation in 10 minutes by Two pensCynthia Hartwig
This is a primer for people who are put under the gun to present creative work in on-the-fly situations. Ten minutes can be well spent, if you do Creative Director Cynthia Hartwig's tricks. Or work you've spent weeks designing, writing and creating, will be deep-sixed faster than you can remove yourself from the room. Includes tips on recapping the creative brief, appointing a Meeting Czar (someone who runs the meeting and drives decisions) and presenting so that everybody can see that little, teeny mouse-type (assuming it's important). This deck is a supplement to How to Rock a Presentation by Cynthia Hartwig at Two Pens.
If you would like to find out more, with a no obligation chat about Arbonne, then please get in touch :-)
ID:440017486
www.katherinebamford.myarbonne.co.uk
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
1. The Hadoop Image Processing (HIP) pipeline acquires vehicle images, identifies updates, generates URLs, crops and resizes images, copies them to asset servers, and removes duplicates.
2. It uses HBase for image storage and archiving, MapReduce for image processing, Kafka for publishing to asset servers, OpenCV for image processing, and Avro for data serialization.
3. Performance testing showed HIP scales linearly and is at least 10x faster than the previous system, and using cascading downloads provided a 20% performance gain.
Big Data - The 5 Vs Everyone Must KnowBernard Marr
This slide deck, by Big Data guru Bernard Marr, outlines the 5 Vs of big data. It describes in simple language what big data is, in terms of Volume, Velocity, Variety, Veracity and Value.
Scrum is an agile framework for managing product development that focuses on continuous delivery of working software in short cycles called sprints, typically two weeks or less. Scrum emphasizes self-organizing cross-functional teams and accountability, iterative development and progress transparency through regular inspection of working increments. Key Scrum practices include sprint planning, daily stand-up meetings, sprint reviews, and retrospectives. Scrum can scale to large, complex projects through techniques like Scrum of Scrums.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
The document discusses big data in astronomy and the LineA-DEXL case. It provides an outline and introduction to big data in science and hypothesis-driven research. It discusses data management techniques like data partitioning and parallel workflow processing. It then provides details on the Laboratorio Nacional de Computacao Cientifica (LNCC) and its role in supporting computational modeling and bioinformatics. It discusses astronomy surveys that generate large amounts of data like the Dark Energy Survey and challenges of data from the Large Synoptic Survey Telescope. Finally, it discusses the need for data infrastructure, metadata management, and distributed data management to support scientific research involving big data.
Building A Scalable Open Source Storage SolutionPhil Cryer
The Biodiversity Heritage Library (BHL), like many other projects within biodiversity informatics, maintains terabytes of data that must be safeguarded against loss. Further, a scalable and resilient infrastructure is required to enable continuous data interoperability, as BHL provides unique services to its community of users. This volume of data and associated availability requirements present significant challenges to a distributed organization like BHL, not only in funding capital equipment purchases, but also in ongoing system administration and maintenance. A new standardized system is required to bring new opportunities to collaborate on distributed services and processing across what will be geographically dispersed nodes. Such services and processing include taxon name finding, indexes or GUID/LSID services, distributed text mining, names reconciliation and other computationally intensive tasks, or tasks with high availability requirements.
The document lists the top 10 most notable open source projects on GitHub in 2012, including Ruby on Rails, CyanogenMod, CocoaPods, Symfony, Zend Framework, OpenStack Compute, Puppet, TrinityCore, Hubot scripts, and Amal Roumi's document is distributed under a Creative Commons license.
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
The document discusses big data analytics in the cloud, including definitions of big data and analytics. It covers technologies like Hadoop, Dremel, and Storm, and how they can be used for business intelligence, operational intelligence, and value creation. It also discusses architecture considerations for big data analytic systems in the cloud, including data transfer speeds. The presentation aims to provide an overview of approaches for near real-time business intelligence and analytics using these technologies, both their applicability and limitations when used in the cloud.
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
The computational requirements of next generation sequencing is placing a huge demand on IT organisations .
Building compute clusters is now a well understood and relatively straightforward problem. However, NGS sequencing applications require large amounts of storage, and high IO rates.
This talk details our approach for providing storage for next-gen sequencing applications.
Talk given at BIO-IT World, Europe, 2009.
This document discusses the intersection of machine learning and search-based software engineering (ML & SBSE). It provides examples of how data miners can find signals in software engineering artifacts using machine learning techniques. It then discusses how better algorithms do not necessarily lead to better mining yet and emphasizes the importance of sharing data, models, and analysis methods. Finally, it outlines a vision for "discussion mining" to guide teams in walking across the space of local models, with the goal of building a science of localism in ML and SBSE.
Viet-Trung Tran presents information on big data and cloud computing. The document discusses key concepts like what constitutes big data, popular big data management systems like Hadoop and NoSQL databases, and how cloud computing can enable big data processing by providing scalable infrastructure. Some benefits of running big data analytics on the cloud include cost reduction, rapid provisioning, and flexibility/scalability. However, big data may not always be suitable for the cloud due to issues like data security, latency requirements, and multi-tenancy overhead.
Viet-Trung Tran presents information on big data and cloud computing. The document discusses key concepts like what constitutes big data, popular big data management systems like Hadoop and NoSQL databases, and how cloud computing can enable big data processing by providing scalable infrastructure. Some benefits of running big data analytics on the cloud include cost reduction, rapid provisioning, and flexibility/scalability. However, big data may not always be suitable for the cloud due to issues like data security, latency requirements, and multi-tenancy overhead.
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows map-reduce performance across a variety of benchmarks like dfsio, pig-mix, nnbench, terasort and YCSB.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
From Sensor Data to Triples: Information Flow in Semantic Sensor NetworksNikolaos Konstantinou
Sensor networks constitute a technological approach of increasing popularity when it comes to monitoring an area, offering context-aware solutions. This shift from Desktop computing to Ubiquitous computing entails numerous options and challenges in designing, implementing and shaping the behavior of systems that consume, integrate, fuse and exploit sensor data. Things tend to be more complicated when, in order to extract meaning from the collected information, the Semantic Web paradigm is adopted. In this talk, we discuss the information flow in systems that collect sensor data into semantically annotated repositories. Specifically, we analyze the journey that information makes, from its capture as electromagnetic pulses by the sensors, to its storage as a semantic web triples, along with its semantics in the system’s knowledge base. We introduce the main related concepts, we analyze the main components that such systems comprise, the choices that can be made, and the respective benefits, drawbacks, and effect to the overall system properties.
This document discusses publishing and consuming linked sensor data. It provides motivation for representing sensor data as linked data by discussing challenges in accessing heterogeneous sensor data from different sources. It then outlines some of the key ingredients needed for linked sensor data, including ontologies to model sensor metadata and observations, guidelines for generating identifiers, and query processing engines for accessing the data. Examples of existing linked sensor data sources are also provided.
The document discusses DataONE, a project aimed at improving data repository interoperability and advancing best practices in data lifecycle management. It focuses on enabling access to multiple external data repositories from within a HUB environment. This would allow users to aggregate and integrate disparate datasets for new analyses, and enable reproducible workflows. The goal is to address issues around scattered and dispersed data by improving discovery, integration and long-term preservation of datasets.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
This document discusses cloud programming models. It begins by defining programming models and noting that they provide an abstraction of a computer system through a language, libraries and runtime system. It then lists some key characteristics of a cloud programming model including efficiency, scalability, fault tolerance and data models. The document outlines an agenda to cover programming models for compute-intensive and big data workloads. It provides examples of bags of tasks and workflow programming models and their applications in fields like bioinformatics.
The document discusses the EGEE project which builds and supports scientific communities using grid computing. It provides a worldwide computing infrastructure integrating software, resources, and technical support. EGEE supports many scientific domains with large data needs like high energy physics, astronomy, genomics and earth observation. It currently connects over 17,000 users, 136,000 CPUs and 25 petabytes of disk and 39 petabytes of tape across 268 sites in 48 countries. The gLite middleware allows applications to access these distributed resources for high throughput data analysis and storage.
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...InfiniteGraph
This document summarizes a presentation given by Leon Guzenda on the synergy between object database, graph database, cloud computing and NoSQL paradigms. It provides a historical overview of object database management systems and discusses their inherent advantages over relational databases. It also covers how these technologies have evolved, including the development of "NoSQL" systems, and how an object database management system can leverage other technologies like Hadoop. The presentation concludes that object database management systems are still highly relevant and that graph databases can complement relational, NoSQL and object database technologies.
Similar to 15 minute presentation about Thesis (20)
15. Probleemstelling
Betere beelden
Betere sensoren Meer informatie
Duurdere opslag
Meer data
Data Transport
Dure supercomputers
Meer rekenwerk
Parallel Processing
Saturday 9 February 13
16. Doelstellingen
• Snel genoeg
• Betaalbaar
• Schaalbaar Bestandssysteem
+
Software framework
Saturday 9 February 13
17. Onderzoeksvragen
• Hoe kunnen grote satellietbeelden in
een HDFS filesysteem opgeslagen
worden zodat ze op een efficiënte
manier in parallel verwerkt kunnen
worden?
• Welke algoritmes kunnen gebruikt
worden met deze opslagtechniek en
MapReduce?
Saturday 9 February 13
19. Literatuurstudie
• Interessante projecten
• HDFS
• MapReduce
• Implementaties
• Distributies
• Huidige Literatuur
Saturday 9 February 13
20. Interessante projecten
• NA (12)
• Center for Climate Simulation
• Square Kilometer Array: 700 TB/sec
• Open Cloud Consortium(13)
• Project Matsu: Elastic Clouds for Disaster Relief
• : Large Hadron Collider (14)
• 20 PB/jaar
Saturday 9 February 13
21. HDFS
1
• Gedistribueerd bestandssysteem 2
...
• Gebaseerd op the Google File System(1) ...
n
• Grote blokken (128 MiB)
• Commodity hardware
• Falen = standaard
• Read & append (1)
Saturday 9 February 13
22. A DFS usually accounts for transparent file replication and fault to
HDFS
bles data locality for processing tasks. A DFS does this by subdividin
ese blocks within a cluster of computers. Figure 2 shows the distrib
of a file (left) subdivided into three blocks.
1 1
3
1 2
2 3
3
2 2
3
1
Figure 2: File blocks, distribution and replication in a distributed file system
Saturday 9 February 13
23. onsult GmbH HDFS Ca
1 1
3
1 2
2 2 3 3
3
2
1
Figure 4: Block assembly for data retrieval from the distributed file system
Saturday 9 February 13
24. rates how the file system handles node-failure by automated recov
HDFS
HDFS further uses checksums to verify block integrity. As long as th
ccessible copy of a block, it can automatically re-replicate to return
tion rate.
1 1 1 1
3 3
2 3 2
3 2 3 3
2
2 2 2
3
1 1
Figure 3: Automatic repair in case of cluster node failure by additional replication
Saturday 9 February 13
25. HDFS - Overzicht
• Schaalbaar
• Snel lezen/schrijven
• Robuust
• Factor 10 goedkoper (2)
Saturday 9 February 13
28. MapReduce - Overzicht
• Based on Google MapReduce (3)
• Data Locality
• Key/Value pairs
• Zeer snel
• Andere manier van denken
Saturday 9 February 13
29. Implementaties
Hadoop Stratosphere HPCC
Support + - +
Extensions + - ?
Community +++ +/- -
Target ANY EDU BI
• Apache Software Foundation
• Anderen: outdated, commercieel,
weinig support (4-6)
Saturday 9 February 13
30. Distributies
(8)
• Hortonworks (7)
•
• Cloudera : Cloudera Manager (9)
• Web Interface
• 1-Click install. (yeah right...)
• Interessant licentie model
Saturday 9 February 13
31. Algemeen
• Vooral tekstverwerking
• Voor kleine afbeeldingen (10)
• Weinig detail
• Commercieel (11)
Saturday 9 February 13
33. Planning
literatuur
fase 1
fase 2
fase 3
fase 4
01 01 15 20
/09 /02 / 03 /0
5
verslag
stage vandaag inleveren
masterproef
Saturday 9 February 13
34. Fase 1 - Done
Sven Workstation Workstation Workstation
192.168.10.248 TT
DN
Master Bruno Tim Patrick
JT TT TT TT
NN DN DN DN
192.168.10.245 192.168.10.246 192.168.10.247 192.168.10.249
JT = Job Tracker = Name Node
NN = RedHat 6.2 = RedHat 6.2
Workstation Virtual Machine
TT = Task Tracker DN = Data Node
Saturday 9 February 13
35. Fase 2
• Eenvoudig algoritme
• Beeld draaien
• Standaard IO
• HDFS
Saturday 9 February 13
36. Fase 3
• Meer complexiteit: MapReduce
• Spatiaal: Convolutiemasker, ROI
• Temporeel/Spectraal: Meerdere
afbeeldingen
•
Saturday 9 February 13
37. Fase 4
• Performantie in functie van pixel
afstand
Saturday 9 February 13
38. Planning
literatuur
fase 1
fase 2
fase 3
fase 4
01 01 15 20
/09 /02 / 03 /0
5
verslag
stage vandaag inleveren
masterproef
Saturday 9 February 13
39. The End
• Veel data
• Anders denken
• Veel mogelijkheden
• RLZ of nieuw keuzevak Big Data? ;)
• Mapreduce + OpenCL?
• Veel uitdagingen
• Veel vragen
Saturday 9 February 13
40. Referenties
(1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’
(2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’
(3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’
(4) http://hadoop.apache.org/
(5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu
(6) http://hpccsystems.com/
(7) http://hortonworks.com/
(8) http://mapr.com/
(9) http://cloudera.com/
(10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’
(11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/
cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-
processing-using-hadoop.htmt
(12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/
SC12/demos/demo20.html
(13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief
(14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.
Saturday 9 February 13