The document proposes a novel vector spatial data storage schema based on Hadoop to address problems with managing large-scale spatial data in cloud computing. It designs a vector spatial data storage scheme using column-oriented storage and key-value mapping to represent topological relationships. It also develops middleware to directly store spatial data and enable geospatial data access using the GeoTools toolkit. Experiments on a Hadoop cluster demonstrate the proposal is efficient and applicable for large-scale vector spatial data storage and expression of spatial relationships.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
This document provides an introduction and overview of Azure Data Lake. It describes Azure Data Lake as a single store of all data ranging from raw to processed that can be used for reporting, analytics and machine learning. It discusses key Azure Data Lake components like Data Lake Store, Data Lake Analytics, HDInsight and the U-SQL language. It compares Data Lakes to data warehouses and explains how Azure Data Lake Store, Analytics and U-SQL process and transform data at scale.
[Given at DAMA WI, Nov 2018] With the increasing prevalence of semi-structured data from IoT devices, web logs, and other sources, data architects and modelers have to learn how to interpret and project data from things like JSON. While the concept of loading data without upfront modeling is appealing to many, ultimately, in order to make sense of the data and use it to drive business value, we have to turn that schema-on-read data into a real schema! That means data modeling! In this session I will walk through both simple and complex JSON documents, decompose them, then turn them into a representative data model using Oracle SQL Developer Data Modeler. I will show you how they might look using both traditional 3NF and data vault styles of modeling. In this session you will:
1. See what a JSON document looks like
2. Understand how to read it
3. Learn how to convert it to a standard data model
This document discusses how Hadoop can be used in data warehousing and analytics. It begins with an overview of data warehousing and analytical databases. It then describes how organizations traditionally separate transactional and analytical systems and use extract, transform, load processes to move data between them. The document proposes using Hadoop as an alternative to traditional data warehousing architectures by using it for extraction, transformation, loading, and even serving analytical queries.
The document discusses moving from traditional ETL processes to "analytics with no ETL" using Hadoop. It describes how Hadoop currently supports some ETL functions by storing raw and transformed data together. However, this still requires periodic loading of new data. The vision is to support complex schemas, perform background format conversion incrementally, and enable schema inference and evolution to allow analyzing data as it arrives without explicit ETL steps. This would provide an up-to-date, performant single view of all data.
This document introduces Amazon Aurora, a MySQL-compatible relational database developed by Amazon Web Services. It provides high performance and availability through a new architecture that leverages distributed storage across three Availability Zones with synchronous replication and automatic failover. Aurora is designed to be simple and cost-effective like open source databases while delivering the performance and availability of commercial databases through its unique storage technology and integration with other AWS services.
This document provides an agenda and overview of a presentation on cloud data warehousing. The presentation discusses data challenges companies face today with large and diverse data sources, and how a cloud data warehouse can help address these challenges by providing unlimited scalability, flexibility, and lower costs. It introduces Snowflake as a first cloud data warehouse built for the cloud, with features like separation of storage and compute, automatic query optimization, and built-in security and encryption. Other cloud data warehouse offerings like Amazon Redshift are also briefly discussed.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
This document provides an introduction and overview of Azure Data Lake. It describes Azure Data Lake as a single store of all data ranging from raw to processed that can be used for reporting, analytics and machine learning. It discusses key Azure Data Lake components like Data Lake Store, Data Lake Analytics, HDInsight and the U-SQL language. It compares Data Lakes to data warehouses and explains how Azure Data Lake Store, Analytics and U-SQL process and transform data at scale.
[Given at DAMA WI, Nov 2018] With the increasing prevalence of semi-structured data from IoT devices, web logs, and other sources, data architects and modelers have to learn how to interpret and project data from things like JSON. While the concept of loading data without upfront modeling is appealing to many, ultimately, in order to make sense of the data and use it to drive business value, we have to turn that schema-on-read data into a real schema! That means data modeling! In this session I will walk through both simple and complex JSON documents, decompose them, then turn them into a representative data model using Oracle SQL Developer Data Modeler. I will show you how they might look using both traditional 3NF and data vault styles of modeling. In this session you will:
1. See what a JSON document looks like
2. Understand how to read it
3. Learn how to convert it to a standard data model
This document discusses how Hadoop can be used in data warehousing and analytics. It begins with an overview of data warehousing and analytical databases. It then describes how organizations traditionally separate transactional and analytical systems and use extract, transform, load processes to move data between them. The document proposes using Hadoop as an alternative to traditional data warehousing architectures by using it for extraction, transformation, loading, and even serving analytical queries.
The document discusses moving from traditional ETL processes to "analytics with no ETL" using Hadoop. It describes how Hadoop currently supports some ETL functions by storing raw and transformed data together. However, this still requires periodic loading of new data. The vision is to support complex schemas, perform background format conversion incrementally, and enable schema inference and evolution to allow analyzing data as it arrives without explicit ETL steps. This would provide an up-to-date, performant single view of all data.
This document introduces Amazon Aurora, a MySQL-compatible relational database developed by Amazon Web Services. It provides high performance and availability through a new architecture that leverages distributed storage across three Availability Zones with synchronous replication and automatic failover. Aurora is designed to be simple and cost-effective like open source databases while delivering the performance and availability of commercial databases through its unique storage technology and integration with other AWS services.
This document provides an agenda and overview of a presentation on cloud data warehousing. The presentation discusses data challenges companies face today with large and diverse data sources, and how a cloud data warehouse can help address these challenges by providing unlimited scalability, flexibility, and lower costs. It introduces Snowflake as a first cloud data warehouse built for the cloud, with features like separation of storage and compute, automatic query optimization, and built-in security and encryption. Other cloud data warehouse offerings like Amazon Redshift are also briefly discussed.
In the past few years, the term "data lake" has leaked into our lexicon. But what exactly IS a data lake? Some IT managers confuse data lakes with data warehouses. Some people think data lakes replace data warehouses. Both of these conclusions are false. Their is room in your data architecture for both data lakes and data warehouses. They both have different use cases and those use cases can be complementary.
Todd Reichmuth, Solutions Engineer with Snowflake Computing, has spent the past 18 years in the world of Data Warehousing and Big Data. He spent that time at Netezza and then later at IBM Data. Earlier in 2018 making the jump to the cloud at Snowflake Computing.
Mike Myer, Sales Director with Snowflake Computing, has spent the past 6 years in the world of Security and looking to drive awareness to better Data Warehousing and Big Data solutions available! Was previously at local tech companies FireMon and Lockpath and decided to join Snowflake due to the disruptive technology that's truly helping folks in the Big Data world on a day to day basis.
This document provides an overview of NoSQL data architecture patterns, including key-value stores, graph stores, and column family stores. It describes key aspects of each pattern such as how keys and values are structured. Key-value stores use a simple key-value approach with no query language, while graph stores are optimized for relationships between objects. Column family stores use row and column identifiers as keys and scale well for large volumes of data.
Ingestion and Historization in the Data Lakeiamtodor
Historization is the process of keeping track of all the changes in the data
We'll talk about:
- Using modern AWS tools for historization
- Monitoring the failures
- Automating and autoscaling process
- Storing the data in an effective way
- Making it easy to access the data with different tools
The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems.
This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R.
Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.
NoSQL, as many of you may already know, is basically a database used to manage huge sets of unstructured data, where in the data is not stored in tabular relations like relational databases. Most of the currently existing Relational Databases have failed in solving some of the complex modern problems like:
• Continuously changing nature of data - structured, semi-structured, unstructured and polymorphic data.
• Applications now serve millions of users in different geo-locations, in different timezones and have to be up and running all the time, with data integrity maintained
• Applications are becoming more distributed with many moving towards cloud computing.
NoSQL plays a vital role in an enterprise application which needs to access and analyze a massive set of data that is being made available on multiple virtual servers (remote based) in the cloud infrastructure and mainly when the data set is not structured. Hence, the NoSQL database is designed to overcome the Performance, Scalability, Data Modelling and Distribution limitations that are seen in the Relational Databases.
This document provides a comparison of SQL and NoSQL databases. It summarizes the key features of SQL databases, including their use of schemas, SQL query languages, ACID transactions, and examples like MySQL and Oracle. It also summarizes features of NoSQL databases, including their large data volumes, scalability, lack of schemas, eventual consistency, and examples like MongoDB, Cassandra, and HBase. The document aims to compare the different approaches of SQL and NoSQL for managing data.
Non-relational databases were developed to address the problems that traditional relational databases have in handling web-scale applications with massive amounts of data and users. They sacrifice consistency to gain availability and partition tolerance. Examples include BigTable, HBase, Dynamo, and Cassandra. They provide benefits like massive scalability, high availability, and elasticity through techniques like consistent hashing, replication, and MapReduce processing.
This document provides an overview of non-relational (NoSQL) databases. It discusses the history and characteristics of NoSQL databases, including that they do not require rigid schemas and can automatically scale across servers. The document also categorizes major types of NoSQL databases, describes some popular NoSQL databases like Dynamo and Cassandra, and discusses benefits and limitations of both SQL and NoSQL databases.
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Prague data management meetup 2018-03-27Martin Bém
This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
Why we need Database Awareness?
Document vs Relational
Row-based vs Column-based
In-memory Database vs In-memory Data grids
Graph
Time-series
Solr vs ElasticSearch
Event Store
The document discusses NoSQL databases and MapReduce. It provides historical context on how databases were not adequate for the large amounts of data being accumulated from the web. It describes Brewer's Conjecture and CAP Theorem, which contributed to the rise of NoSQL databases. It then defines what NoSQL databases are, provides examples of different types, and discusses some large-scale implementations like Amazon SimpleDB, Google Datastore, and Hadoop MapReduce.
The document summarizes the history and evolution of non-relational databases, known as NoSQL databases. It discusses early database systems like MUMPS and IMS, the development of the relational model in the 1970s, and more recent NoSQL databases developed by companies like Google, Amazon, Facebook to handle large, dynamic datasets across many servers. Pioneering systems like Google's Bigtable and Amazon's Dynamo used techniques like distributed indexing, versioning, and eventual consistency that influenced many open-source NoSQL databases today.
The document provides an introduction to NOSQL databases. It begins with basic concepts of databases and DBMS. It then discusses SQL and relational databases. The main part of the document defines NOSQL and explains why NOSQL databases were developed as an alternative to relational databases for handling large datasets. It provides examples of popular NOSQL databases like MongoDB, Cassandra, HBase, and CouchDB and describes their key features and use cases.
The session will be a deep dive introduction to Snowflake that includes Snowflake architecture, Virtual Warehouses, Designing a real use case, Loading data into Snowflake from a Data Lake.
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources.
Learn what you need to consider when moving from the world of relational databases to a NoSQL document store.
Hear from Developer Advocate Glynn Bird as he explains the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
The document summarizes three options to upgrade the city's electric utility dispatch emergency phone system. Option 1 involves updating the current aging system. Option 2 is a new on-site system with more capabilities but higher costs. Option 3 is a hosted call center that provides more benefits than Option 2 at a lower overall cost, but requires a new 1-800 number for outage calls. Staff recommends Option 3 to move to a hosted call center and advertise the new identification process and call-in number starting in March 2013.
This document summarizes key concepts in geographic data quality and coordinate systems. It discusses the seven dimensions of geographic data quality according to NCDCDS and ICA: lineage, positional accuracy, attribute accuracy, logical consistency, completeness, temporal accuracy, and semantic accuracy. It also defines key terms like datum, geoid, and ellipsoid used in coordinate systems and for measuring positional accuracy of geographic data. Common coordinate systems are also outlined, including UTM, WGS84 and Everest 1830 used in India.
In the past few years, the term "data lake" has leaked into our lexicon. But what exactly IS a data lake? Some IT managers confuse data lakes with data warehouses. Some people think data lakes replace data warehouses. Both of these conclusions are false. Their is room in your data architecture for both data lakes and data warehouses. They both have different use cases and those use cases can be complementary.
Todd Reichmuth, Solutions Engineer with Snowflake Computing, has spent the past 18 years in the world of Data Warehousing and Big Data. He spent that time at Netezza and then later at IBM Data. Earlier in 2018 making the jump to the cloud at Snowflake Computing.
Mike Myer, Sales Director with Snowflake Computing, has spent the past 6 years in the world of Security and looking to drive awareness to better Data Warehousing and Big Data solutions available! Was previously at local tech companies FireMon and Lockpath and decided to join Snowflake due to the disruptive technology that's truly helping folks in the Big Data world on a day to day basis.
This document provides an overview of NoSQL data architecture patterns, including key-value stores, graph stores, and column family stores. It describes key aspects of each pattern such as how keys and values are structured. Key-value stores use a simple key-value approach with no query language, while graph stores are optimized for relationships between objects. Column family stores use row and column identifiers as keys and scale well for large volumes of data.
Ingestion and Historization in the Data Lakeiamtodor
Historization is the process of keeping track of all the changes in the data
We'll talk about:
- Using modern AWS tools for historization
- Monitoring the failures
- Automating and autoscaling process
- Storing the data in an effective way
- Making it easy to access the data with different tools
The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems.
This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R.
Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.
NoSQL, as many of you may already know, is basically a database used to manage huge sets of unstructured data, where in the data is not stored in tabular relations like relational databases. Most of the currently existing Relational Databases have failed in solving some of the complex modern problems like:
• Continuously changing nature of data - structured, semi-structured, unstructured and polymorphic data.
• Applications now serve millions of users in different geo-locations, in different timezones and have to be up and running all the time, with data integrity maintained
• Applications are becoming more distributed with many moving towards cloud computing.
NoSQL plays a vital role in an enterprise application which needs to access and analyze a massive set of data that is being made available on multiple virtual servers (remote based) in the cloud infrastructure and mainly when the data set is not structured. Hence, the NoSQL database is designed to overcome the Performance, Scalability, Data Modelling and Distribution limitations that are seen in the Relational Databases.
This document provides a comparison of SQL and NoSQL databases. It summarizes the key features of SQL databases, including their use of schemas, SQL query languages, ACID transactions, and examples like MySQL and Oracle. It also summarizes features of NoSQL databases, including their large data volumes, scalability, lack of schemas, eventual consistency, and examples like MongoDB, Cassandra, and HBase. The document aims to compare the different approaches of SQL and NoSQL for managing data.
Non-relational databases were developed to address the problems that traditional relational databases have in handling web-scale applications with massive amounts of data and users. They sacrifice consistency to gain availability and partition tolerance. Examples include BigTable, HBase, Dynamo, and Cassandra. They provide benefits like massive scalability, high availability, and elasticity through techniques like consistent hashing, replication, and MapReduce processing.
This document provides an overview of non-relational (NoSQL) databases. It discusses the history and characteristics of NoSQL databases, including that they do not require rigid schemas and can automatically scale across servers. The document also categorizes major types of NoSQL databases, describes some popular NoSQL databases like Dynamo and Cassandra, and discusses benefits and limitations of both SQL and NoSQL databases.
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Prague data management meetup 2018-03-27Martin Bém
This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
Why we need Database Awareness?
Document vs Relational
Row-based vs Column-based
In-memory Database vs In-memory Data grids
Graph
Time-series
Solr vs ElasticSearch
Event Store
The document discusses NoSQL databases and MapReduce. It provides historical context on how databases were not adequate for the large amounts of data being accumulated from the web. It describes Brewer's Conjecture and CAP Theorem, which contributed to the rise of NoSQL databases. It then defines what NoSQL databases are, provides examples of different types, and discusses some large-scale implementations like Amazon SimpleDB, Google Datastore, and Hadoop MapReduce.
The document summarizes the history and evolution of non-relational databases, known as NoSQL databases. It discusses early database systems like MUMPS and IMS, the development of the relational model in the 1970s, and more recent NoSQL databases developed by companies like Google, Amazon, Facebook to handle large, dynamic datasets across many servers. Pioneering systems like Google's Bigtable and Amazon's Dynamo used techniques like distributed indexing, versioning, and eventual consistency that influenced many open-source NoSQL databases today.
The document provides an introduction to NOSQL databases. It begins with basic concepts of databases and DBMS. It then discusses SQL and relational databases. The main part of the document defines NOSQL and explains why NOSQL databases were developed as an alternative to relational databases for handling large datasets. It provides examples of popular NOSQL databases like MongoDB, Cassandra, HBase, and CouchDB and describes their key features and use cases.
The session will be a deep dive introduction to Snowflake that includes Snowflake architecture, Virtual Warehouses, Designing a real use case, Loading data into Snowflake from a Data Lake.
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources.
Learn what you need to consider when moving from the world of relational databases to a NoSQL document store.
Hear from Developer Advocate Glynn Bird as he explains the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
The document summarizes three options to upgrade the city's electric utility dispatch emergency phone system. Option 1 involves updating the current aging system. Option 2 is a new on-site system with more capabilities but higher costs. Option 3 is a hosted call center that provides more benefits than Option 2 at a lower overall cost, but requires a new 1-800 number for outage calls. Staff recommends Option 3 to move to a hosted call center and advertise the new identification process and call-in number starting in March 2013.
This document summarizes key concepts in geographic data quality and coordinate systems. It discusses the seven dimensions of geographic data quality according to NCDCDS and ICA: lineage, positional accuracy, attribute accuracy, logical consistency, completeness, temporal accuracy, and semantic accuracy. It also defines key terms like datum, geoid, and ellipsoid used in coordinate systems and for measuring positional accuracy of geographic data. Common coordinate systems are also outlined, including UTM, WGS84 and Everest 1830 used in India.
The document discusses different types of spatial data used in geographic information systems (GIS). It describes vector data, which represents geographic features as points, lines, and polygons, and raster data, which divides the landscape into a grid of cells. It outlines some common vector data formats like shapefiles, coverages, and geodatabases, and raster formats like grids and digital elevation models. It provides details on how vector data structures represent points, lines, polygons, and their topological relationships in ArcGIS.
This document discusses spatial databases and spatial data mining. It introduces spatial databases as databases that store large amounts of space-related data with special data types for spatial information. Spatial data mining extracts patterns and relationships from spatial data. The document also discusses spatial data warehousing with dimensions and measures for spatial and non-spatial data, mining spatial association patterns from spatial databases, techniques for spatial clustering, classification, and trend analysis.
Spatial data defines a location using points, lines, polygons or pixels and includes location, shape, size and orientation. Non-spatial data relates to a specific location and includes statistical, text, image or multimedia data linked to spatial data defining the location. The document outlines key differences between spatial and non-spatial data, noting that spatial data is multi-dimensional and correlated while non-spatial data is one-dimensional and independent, with implications for conceptual, processing and storage issues.
This document discusses two types of spatial data used in GIS - vector data and raster data. Vector data represents geographic features as points, lines, and polygons using vertices with x, y, and z coordinates. It allows for accurate representation of shapes and storing of attribute data but requires more processing and storage. Raster data represents geographic information through a grid of cells (pixels), with each cell storing attribute values like elevation or temperature. It is better for continuously changing data but cannot represent linear features well and has increased storage needs at higher resolutions. Both data types have advantages and disadvantages depending on the use case.
Black Arrow Trading Est. is a Saudi trading company that provides a wide range of products and services including business office products, construction materials, information and technology solutions, safety and security systems, electro-mechanical materials, and rental equipment and manpower. The company is committed to delivering exceptional and flexible service to customers through innovative and low-cost solutions.
This short document promotes creating presentations using Haiku Deck on SlideShare. It encourages the reader to get started making their own Haiku Deck presentation by simply clicking the "GET STARTED" prompt. In just one sentence, it pitches presentation creation using Haiku Deck on SlideShare's platform.
- Hermione is a voracious reader in her late 20s who uses multiple platforms and devices to read content but struggles to keep track of useful parts.
- Pocket is testing new highlighting and annotation functionality to allow users to save and organize important parts of articles across devices.
- The document discusses testing this new functionality with power users, students/journalists, and publishers to understand user needs and how Pocket can provide value to different customer segments.
The document provides guidance on dynamic and static stretching exercises and routines to be performed before and after training sessions. It also outlines sample 30-minute training sessions focusing on running drills, yoga poses, and weight training exercises targeting the chest, legs, back, hamstrings and core. Additionally, it lists various endurance and anaerobic training activities incorporating runs, sprints and agility drills.
No touch porfis de fernando jose duarte tipton12345tyuiop
This document lists 5 video games: San Andreas, Need for Speed, Minecraft, Five Nights at Freddy's, and FIFA 2015. San Andreas and Need for Speed are open world racing games. Minecraft is a sandbox game and Five Nights at Freddy's is a horror game where the player must survive nights at a pizza place haunted by animatronic animals.
This document provides information about an upcoming workshop on creating engaging communication channels. The workshop will be facilitated by Alison Davis and Caroline Hey on March 26, 2015. It will provide 7 breakthrough ideas for communication channels and discuss how to capture attention in today's information overloaded environment in 3 seconds or less. The document outlines the agenda and speakers' backgrounds to introduce the topic and workshop.
Universal Realty World is a real estate company located in North Carolina that can be reached by calling 828-500-1001. They provide real estate services to help buyers and sellers in their area.
The document provides information about various educational technology tools and resources for teachers including:
- Thinglink for creating interactive images and videos with embedded media
- Storyboard That for creating digital storyboards
- Edpuzzle for creating interactive video lessons with embedded questions
- Socrative for creating and administering classroom polls and quizzes
- Weebly for creating free websites and embedding quizzes
- Additional resources like Slideshare and links to NJ technology standards
The document encourages teachers to create free accounts with certain tools and provides brief instructions for basic account setup and creation of sample interactive projects using the tools.
This document discusses data-intensive computing and provides examples of technologies used for processing large datasets. It defines data-intensive computing as concerned with manipulating and analyzing large datasets ranging from hundreds of megabytes to petabytes. It then characterizes challenges including scalable algorithms, metadata management, and high-performance computing platforms and file systems. Specific technologies discussed include distributed file systems like Lustre, MapReduce frameworks like Hadoop, and NoSQL databases like MongoDB.
The document discusses data-intensive computing and provides details about related technologies. It defines data-intensive computing as concerned with large-scale data in the hundreds of megabytes to petabytes range. Key challenges include scalable algorithms, metadata management, high-performance computing platforms, and distributed file systems. Technologies discussed include MapReduce frameworks like Hadoop, Pig, and Hive; NoSQL databases like MongoDB, Cassandra, and HBase; and distributed file systems like Lustre, GPFS, and HDFS. The document also covers programming models, scheduling, and an example application to parse Aneka logs using MapReduce.
This document discusses cloud and big data technologies. It provides an overview of Hadoop and its ecosystem, which includes components like HDFS, MapReduce, HBase, Zookeeper, Pig and Hive. It also describes how data is stored in HDFS and HBase, and how MapReduce can be used for parallel processing across large datasets. Finally, it gives examples of using MapReduce to implement algorithms for word counting, building inverted indexes and performing joins.
- Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across clusters of computers. It divides files into blocks and stores the blocks across nodes, replicating them for fault tolerance.
- HDFS is designed for distributed storage and processing of very large datasets. It allows applications to work with data in parallel on large clusters of commodity hardware.
This document provides an introduction to a course on data science. It outlines the course objectives, which are to recognize key concepts in extraction, transformation and loading of data, and to complete a sample project in Hadoop. It also lists the expected course outcome, which is for students to recognize technologies for handling big data. The document then provides a chapter index and overview of topics to be covered, including distributed and parallel computing for big data, big data technologies, cloud computing, in-memory technologies, and big data techniques.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
Couchbase is a complete NoSQL database solution for big data. It provides a distributed database that can scale horizontally. Couchbase uses a document-oriented data model and supports the CAP theorem. It sacrifices consistency to achieve high availability and partition tolerance. Couchbase is used by many large companies for applications that involve large, complex datasets with high user volumes and real-time requirements.
This document discusses big data analytics using Hadoop. It provides an overview of loading clickstream data from websites into Hadoop using Flume and refining the data with MapReduce. It also describes how Hive and HCatalog can be used to query and manage the data, presenting it in a SQL-like interface. Key components and processes discussed include loading data into a sandbox, Flume's architecture and data flow, using MapReduce for parallel processing, how HCatalog exposes Hive metadata, and how Hive allows querying data using SQL queries.
Les mégadonnées représentent un vrai enjeu à la fois technique, business et de société
: l'exploitation des données massives ouvre des possibilités de transformation radicales au
niveau des entreprises et des usages. Tout du moins : à condition que l'on en soit
techniquement capable... Car l'acquisition, le stockage et l'exploitation de quantités
massives de données représentent des vrais défis techniques.
Une architecture big data permet la création et de l'administration de tous les
systèmes techniques qui vont permettre la bonne exploitation des données.
Il existe énormément d'outils différents pour manipuler des quantités massives de
données : pour le stockage, l'analyse ou la diffusion, par exemple. Mais comment assembler
ces différents outils pour réaliser une architecture capable de passer à l'échelle, d'être
tolérante aux pannes et aisément extensible, tout cela sans exploser les coûts ?
Le succès du fonctionnement de la Big data dépend de son architecture, son
infrastructure correcte et de son l’utilité que l’on fait ‘’ Data into Information into Value ‘’.
L’architecture de la Big data est composé de 4 grandes parties : Intégration, Data Processing
& Stockage, Sécurité et Opération.
Big data is generated from a variety of sources at a massive scale and high velocity. Hadoop is an open source framework that allows processing and analyzing large datasets across clusters of commodity hardware. It uses a distributed file system called HDFS that stores multiple replicas of data blocks across nodes for reliability. Hadoop also uses a MapReduce processing model where mappers process data in parallel across nodes before reducers consolidate the outputs into final results. An example demonstrates how Hadoop would count word frequencies in a large text file by mapping word counts across nodes before reducing the results.
The document provides an overview of Apache Hadoop and how it addresses challenges related to big data. It discusses how Hadoop uses HDFS to distribute and store large datasets across clusters of commodity servers and uses MapReduce as a programming model to process and analyze the data in parallel. The core components of Hadoop - HDFS for storage and MapReduce for processing - allow it to efficiently handle large volumes and varieties of data across distributed systems in a fault-tolerant manner. Major companies have adopted Hadoop to derive insights from their big data.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses limitations in traditional RDBMS for big data by allowing scaling to large clusters of commodity servers, high fault tolerance, and distributed processing. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. Hadoop has an ecosystem of additional tools like Pig, Hive, HBase and more. Major companies use Hadoop to process and gain insights from massive amounts of structured and unstructured data.
Colorado Springs Open Source Hadoop/MySQL David Smelker
This document discusses MySQL and Hadoop integration. It covers structured versus unstructured data and the capabilities and limitations of relational databases, NoSQL, and Hadoop. It also describes several tools for integrating MySQL and Hadoop, including Sqoop for data transfers, MySQL Applier for streaming changes to Hadoop, and MySQL NoSQL interfaces. The document outlines the typical life cycle of big data with MySQL playing a role in data acquisition, organization, analysis, and decisions.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
Similar to Research on vector spatial data storage scheme based (20)
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
2. OBJECTIVE
• Cloud computing technology is changing the mode of the spatial information
industry which is applied and provides new ideas for it.
• Since Hadoop platform provides easy expansion, high performance, high fault
tolerance and other advantages, we propose a novel vector spatial data storage
schema based on it to solve the problems on how to use cloud computing
technology to directly manage spatial data and present data topological relations.
• Firstly, vector spatial data storage schema is designed based on column-oriented
storage structures and key/value mapping to express spatial topological relations.
• Secondly, we design middleware and merge with vector spatial data storage
schema in order to directly store spatial data and present geospatial data access
refinement schemes based on GeoTools toolkit.
• Thirdly, we verify the middleware and the data storage schema through Hadoop
cluster experiments. Comprehensive experiments demonstrate that our proposal is
efficient and applicable to directly storing large-scale vector spatial data and
timely express spatial topological relations.
3. INDEX
INTRODUCTION
CLOUND COPUTING
CLOUD PROPERTIES
CLOUD COMPUTING INFRASTRUCTURE
CLASSIFICATION OF CLOUD COMPUTING BASED ON SERVICE PROVIDED
WHAT IS HADOOP
HADOOP COMPNENTS
HADOOP DISTRIBUTION FILE SYSTEM
DATA STORAGE BASED ON HADOOP
HBASE DATABASES’S STORAGE MECHANISM
SPATIAL DATA
DESIGINING VECTOR SPATIAL DATA STORAGE SCHEME
VECTOR SPATIAL OBJECT MODEL
VECTOR SPATIAL DATA LOGICAL STORAGE
VECTOR SPATIAL DATA PHYSICAL STORAGE
DEVELOPING MIDDILEWARE BASED ON GEO
EXPRIMENTAL RESULTS
CONCLUSIONS AND FUTURE WORK
REFERENCES
4. INTRODUCTION
• Spatial data is the basis of GIS applications.
• GIS. With the advancements of data acquisition techniques, large amounts of geospatial
data have been collected from multiple data sources, such as satellite observations,
remotely sensed imagery, aerial photography, and model simulations.
• The geospatial data are growing exponentially to PB (Petabyte) scale even EB (Exabyte)
scale . As this presents a great challenge to the traditional database storage, especially in
terms of vector spatial data storage due to its complex structure, the traditional spatial
database storage is facing a series of questions such as poor scalability and low efficiency
of data storage.
• With the superiority in scalability and data storage efficiency, Hadoop, and large-scale
distributed data management platform in general, provides an efficient way for large-scale
vector spatial data storage.
• Many scholars have done data storage based on Cloud computing technology.
• Applied the MapReduce model to process spatial data. Researched geospatial data
storage, geospatial data index based on Hadoop platform.
• Hadoop platform with Oracle Spatial database in attribute data query and concluded that
Hadoop is more efficient in data query.
• Jifeng Cui, etc the heterogeneous geospatial data organization storage based on Google's
GFS to solve the problem of multi-source geospatial data storage and query efficiency.
• Relevant challenge ,how to use unstructured database to directly store spatial data.
5. CLOUD COMPUTING
• What is the “cloud”?
• Easier to explain with examples:
• Gmail is in the cloud
• Amazon (AWS) EC2 and S3 are the cloud
• Google AppEngine is the cloud
• SimpleDB is in the cloud
• “Cloud computing is the delivery of computing as a service rather
than a product, whereby shared resources, software, and information
are provided to computers and other devices as a utility (like the
electricity grid) over a network (typically the Internet). “
6. CLOUD PROPERTIES
• Cloud offers:
• Scalability : means that you (can) have infinite resources, can handle unlimited
number of users
• Reliability (hopefully!)
• Availability (24x7)
• Elasticity : you can add or remove computer nodes and the end user will not be
affected/see the improvement quickly.
• Multi-tenancy : enables sharing of resources and costs across a large pool of
users. Lower cost, higher utilization… but other issues: e.g. security.
7. CLOUD COMPUTING INFRASTRUCTURE
• Computation model: Map Reduce* partion the joband reduce it
• Storage model: HDFS*
• Other computation models: HPC/Grid Computing
• Network structure
Types of cloud
• Public Cloud: Computing infrastructure is hosted at the vendor’s premises.
• Private Cloud: Computing architecture is dedicated to the customer and is not shared with other
organizations.
• Hybrid Cloud: Organizations host some critical, secure applications in private clouds. The not so
critical applications are hosted in the public cloud
• Cloud bursting: the organization uses its own infrastructure for normal usage, but cloud is used for peak loads.
• Community Cloud
8. CLASSIFICATION OF CLOUD COMPUTING
BASED ON SERVICE PROVIDED
• Infrastructure as a service (IaaS)
• Offering hardware related services using the principles of cloud computing. These could include
storage services (database or disk storage) or virtual servers.
• Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale.
• Platform as a Service (PaaS)
• Offering a development platform on the cloud.
• Google’s Application Engine, Microsofts Azure, Salesforce.com’s force.com
.
• Software as a service (SaaS)
• Including a complete software offering on the cloud. Users can access a
software application hosted by the cloud vendor on pay-per-use basis. This
is a well-established sector.
• Salesforce.coms’ offering in the online Customer Relationship Management
(CRM) space, Googles gmail and Microsofts hotmail, Google docs.
9. WHAT IS HADOOP??
• Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
• Large datasets Terabytes or petabytes of data
• Large clusters hundreds or thousands of nodes
• Hadoop is open-source implementation for Google MapReduce
• Hadoop is based on a simple programming model called MapReduce
• Hadoop is based on a simple data model, any data will fit
• Download from hadoop.apache.org
• To install locally, unzip and set JAVA_HOME
• Details: hadoop.apache.org/core/docs/current/quickstart.html
• Three ways to write jobs:
• Java API
• Hadoop Streaming (for Python, Perl, etc)
• Pipes API (C++)
10. HADOOP COMPONENTS
• Distributed file system (HDFS)
• Single namespace for entire cluster
• Replicates data 3x for fault-tolerance
• MapReduce framework
• Executes user jobs specified as “map” and “reduce” functions
• Manages work distribution & fault-tolerance
11. HADOOP DISTRIBUTION FILE SYSYEM
• Files split into 128MB blocks
• Blocks replicated across several
datanodes (usually 3)
• Single namenode stores metadata
(file names, block locations, etc)
• Optimized for large files,
sequential reads
• Files are append-only
Namenode
Datanodes
1
2
3
4
1
2
4
2
1
3
1
3
3
2
4
12. DISTRIBUTED FILE SYSTEM HDFS
• HDFS, and Hadoop Distributed File System,
is the main application of the Hadoop
distributed file system and is shown in
Figure . And it stores data in blocks.
• Each block is the same size and its default
size is 64 MB.
• A HDFS cluster is the composition of
NameNode and a number of DataNodes.
• Clients firstly read metadata information
through NameNode, and then access the
data through the appropriate DataNode.
• If clients NameNode would record the
metadata store data in the DataNode,
information of data
The Architecture of HDFS [9]
NameNode
Recording metadata information: directory,
the number of copies, file name, location,
the number of copies, file name, location,
Recording
metadata
Writing
data
Writing
Data
Node
reading
data
Data block
Data Node
Data block
Data Node
Data block
Data NOde
Client
Client
Reading
metadata
13. DATA STORAGE BASED ON HADOOP
• Hadoop, and large-scale distributed data processing in general, is
rapidly becoming an important skill sets for many programmers .
• It is the core composition of distributed file system HDFS, distributed
unstructured database
• HBase and distributed parallel computing framework MapReduce.
• The key to the distinctions of Hadoop is accessible, robust and
scalable.
• Today, Hadoop is a core part of the computing infrastructure for many
web companies, such as Yahoo, Facebook, LinkedIn, and Twitter.
Many more traditional businesses, such as media and telecom, are
beginning to adopt this system, too.
14. HBASE DATABASE’S STORAGE MECHANISM
• It can use the local file system and HDFS.
• But HBase has the ability of processing data and improving data reliability and the robustness of the system if
it uses HDFS as its file system..
• HBase stores data on disk in a column -oriented format, and it is distinctly different from traditional columnar
databases:
• HBase excels at providing key-based access to a specific cell of data, or a sequential range of cells.
• Each row of the same table can have very different column and each column has a time version that is called
“timestamp”.
• Timestamp records the update of database that indicates an updated version.
• The logical view of the HBase database has two columns family: c1 and c2 that are shown by the Table 1.
• Each row of data expresses the update through timestamp.
• Each “column cluster” is saved through Several files and different column clusters are stored separately.
• The feature that is different from traditionally row-oriented database.
• In a row-oriented system, indexes tables as additional structures are built to get fast results for queries.
• HBase database does not need “additional storage for data indexes”, because data is stored with the indexes
itself.
• Data loading is executed faster in a column-oriented system than in a row-oriented one.
• all data of each row is stored together in a row-oriented and all data of in a column is stored together in a
column-oriented database.
• column-oriented system makes all columns parallel load to reduce the time of data loading.
15. HBASE DATABASE’S STORAGE
MECHANISM(CONTD.)
RowKey
Time Column Family:c1 Column Family:c2
Stamp info value Attribute value
t6 c1:1 Value1
r1
t5 c1:2 Value2 c2:1 Value1
t4 c2:2 Value2
t3
r2
t2 c1:1 Value1
t1 c1:2 Value2
Table 1. HBase Storage Data Logic View
Figure The Relationship between HBase and HDFS
Hadoop Map Reduce
Hbase
HDFS
Zookeeper
16. DESIGNING VECTOR SPATIAL DATA
STORAGE SCHEMA
• vector data model:A representation of the world using points, lines, and
polygons.
• Vector models are useful for storing data that has discrete boundaries, such
as country borders, land parcels, and streets.
• Vector data is more complex than raster data (grid of cells. Raster models
are useful for storing data that varies continuously, photograph,
a satellite image, a surface of chemical concentrations etc) organization
because it not only considers the scale, layers, point, line, surface and other
factors also involves “complex spatial topological relations”.
• We should design vector data storage schema that fits on the Hadoop
distributed platform in order to take advantage of Hadoop storage.
• This schema would offer an efficient organization and complete the storage
of vector spatial data in the unstructured database platform.
17. WHAT IS SPATIAL DATA ??
• Spatial Refers to space
• Spatial data refers to all types of data objects or elements that are present in a
geographical space or horizon.
• It enables the global finding and locating of individuals or devices anywhere in
the world.
• Spatial data is also known as geospatial data, spatial information or geographic
information.
• Spatial data is used in geographical information system (GIS) and other
geoloaction or positioning services.
• Spatial Consists of points, lines,polygon and other geographic data and geometric
premitives which can be mapped location,stored with an object as meta data or
used by a communication system to locate the user devices
• Spatial data may be classified as “scaler or vector data”.
18. VECTOR SPATIAL OBJECT MODEL
• OGC simple factor model that is proposed by International Association of Open
GIS is shown in Figure to share geospatial information and geospatial services.
• we use OGC simple factor model to design vector object model
• order to achieve the better interoperability of heterogeneous spatial database.
19. VECTOR SPATIAL DATA LOGICAL STORAGE
• Vector data consists of coordinate data, attribute data, topology data .
• We designed vector spatial data storage schema based on the HBase database
storage model according to the characteristics of vector data.
• Vector spatial data logical storage schema is shown by the Table 2 and it
contains three columns family which are coordinate, attribute, topology,
respectively recording coordinate information, attribute information and
topology information of data.
• Each data type in the storage system is string and is parsed into appropriate
data type in accordance with the Table 3 (The dictionary storage structure of
vector data type)
• We can cleverly design “RowKey” in accordance with the actual situation
and usage scenarios to obtain a collection of the results in the query and
good performance.
• The variable “tableName” represents the table’s name and the variable
familys contains some columns family in a table:
20. Vector Spatial Data Logical Storage(Contd.)
public static void creatTable(String tableName, String[] familys) throws Exception {
HBaseAdmin admin = new HBaseAdmin(conf);
if (admin.tableExists(tableName)) { System.out.println("table already exists!");
} else{
HTableDescriptor tableDesc = new HTableDescriptor(tableName);
For (int i=o;i<familys.length;i++) {
tableDesc.addFamily(new HcolumnDescriptor(familys[i])); }
admin CreateTable(tableDesc);
21. Vector Spatial Data Logical Storage(Contd.)
Column Family: Column Family: Column Family:
RowKey TimeStamp Coordinate Attribute topology
Info value Attribute value Topology value
T8 Info:1 (x, y)
T7 Info:2 coordinate
T6 Attribute:1 Value1
T5 Attribute:2 Value2
T4
T3 Topo:1 Value1
T2 Info:1 (x, y)
T1 Info:2 coordinate
Time
Column Family: Column Family: Column Family:
Coordinate Attribute Topology
Stamp
info Value Attribute value Topology value
T8 “(x, y)”
“double
”
T7
“Coordinate
“string”
”
“Attribute1
T6 “int”
”
T5
“Attribute2
“string”
”T3 “Topo1” “int”
Table 3. The Dictionary Storage Structure for Vector Data Types
Table 2. Vector Spatial Data Storage View
22. VECTOR SPATIAL DATA PHYSICAL STORAGE
• HBase stores data on disk in a column-oriented format although the logical view
consists of many lines.
• The physical storage that the RowKey is Fea_ID1 in the Table 2 is shown by the
Table 4, Table 5 and Table 6.
• From these tables, it concluded that the blank column on the logic view in Table 2
could not be stored on the physical model actually.
• It is different from “relational database” when we design data storage model and
develop procedure.
• In HBase database, we don’t need built additional indexes, so data is stored
within indexes themselves.
• In the data query, Vector spatial data storage model based on HBase database only
read the columns required in a query.
• This queried way makes it provide a better performance for analytical request.
23. Vector Spatial Data Physical Storage(Contd.)
Row Key Time Stamp
Column Family: Coordinate
info value
Fea_ID1
T8 Info:1 (x, y)
T7 Info:2 coordinate
Row Key Time Stamp Column Family: Attribute
Attribute value
Fea_ID1 T6 “Attribute1” “int”
T5 “Attribute2” “string”
Row Key
Time Column Family: Topology
Stamp Topology value
Fea_ID1 T3 Topo:1 Value1
Table 4. The Physical Storage of Coordinate Column
Table 5. The Physical Storage of Attribute Column
Table 6. The Physical Storage of Technical Column
24. Developing Middleware Based on GeoTools
Due to expensive cost of commercial GIS(Geographic Information System) software.
GeoTools is an open source GIS toolkit developed from free soft foundation with the Java language
code.
it contains a lot of open source GIS projects and standards-based GIS interface, provides a lot of
GIS algorithm, and possesses a good performance in the reading and writing of various data formats.
In this experiment, we use GeoTools-2.7.5 open source project to read shapefile data from client
and make the appropriate conversion to use the “put ()” method to import data into HBase database
by Data Store, Feature Source, Feature Collection class libraries.
we design middleware and develop these methods and vector spatial data storage schema to access
and display the vector spatial data based on GeoTools toolkit.
According to; HBase database query mechanism, we use “get ( ) and san ( )” method to search data
from database. Get ( ) method acquires a single record and scan ( ) method requires range queries on
spatial data by limiting setStartRow ( ) and setStopRow ( ).
//Get() reading a single record HTable table = new HTable(conf, tableName);
Get get = new Get(rowKey.getBytes()); Result rs = table.get(get);
• Bytes[] ret=rs.getValue((Familys+":"+Column)) ; //Scan() requiring range queries
International Journal of Database Theory and Application
HTable table = new HTable(conf, tableName);
Scan s = new Scan(); s.setStartRow(startRow); s.setStopRow(stopRow);
ResultScanner rs = table.getScanner(s);
25. Experimental Results
In Hadoop client, we use the scan ( ) method to query the
J48E023023 road layer data from 1:50000 vector data.
To complete the inquiry process Hadoop platform takes
1.26 seconds, while the Oracle Spatial platform takes 1.34
seconds.
Due to Hadoop platform is designed to manage large files,
it suffers performance penalty while managing a large
amount of data.
It can be seen that the efficiency of data storage is not too
high because the amount of data is too small.
HBase database is used to manage massive spatial data
efficiently
And it expands nodes to obtain more storage space and
improve computational efficiency.
Finally, we use the “middleware” to develop the Feature (
), FeatureBuilder ( ), FeatureCollection ( ),
ShapefileDataStore ( )
class libraries to create shapefile for the reading data in
order to show this data by the middleware.
Road Layer Data Shown
26. CONCLUSION AND FUTURE WORK
we analyze HDFS distributed file system and HBase distributed database storage
mechanism and offer the vector spatial data storage schema based on Hadoop open
source distributed cloud storage platform.
Finally, we design middleware to merge with vector spatial data storage schema and
verify the effectiveness and availability of the vector spatial data storage schema
through the experiment.
it also provides an effective way for large-scale vector spatial data storage and for
many companies which are committed to study Hadoop to store large-scale data.
Theoretically, according to Hadoop data storage strategy, we overcome poor
scalability, low efficiency and other problems of traditional relational database to
provide unlimited storage space and high reading and writing performance for large-
scale spatial data.
we should design “excellent spatial data partition strategy and build distributed
spatial index structure with high performance to efficiently manage large-scale spatial
data”.
Future work should look at spatial data partition strategy and distributed spatial index
structure with the goal of further enhancing data management effectiveness.
27. REFERENCES
• Y. Zhong, J. Han, T. Zhang and J. Fang, “A distributed geospatial data storage and processing framework for large-scale WebGIS”, 20th
International Conference on Geoinformatics (GEOINFORMATICS), Hong Kong China, (2012) June 15-17.
• S. Sakr, A. Liu, D. M. Batista and M. Alomari, “A Survey of Large Scale Data Management Approaches in Cloud Environments”,
Communications Surveys & Tutorials, vol. 13, no. 3, (2011), pp. 311-336.
• X. H. Liu, J. Han and Y. Zhong, “Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS”,
IEEE International Conference on Cluster Computing and Workshops, New Orleans, Louisiana, (2009) August 31- September 4, pp.
1- 8.
• D.-W. Zhang, F.-Q. Sun, X. Cheng and C. Liu, “Research on hadoop-based enterprise file cloud storage system”, 3rd International
Conference on in Awareness Science and Technology (iCAST), Dalian China, (2011) September 27-30, pp. 434-437.
• A. Cary, Z. Sun, V. Hristidis and N. Rishe, “Experiences on Processing Spatial Data with MapReduce, the 21st International
Conference on Scientific and Statistical Database Management”, New Orleans, LA, USA, (2009) June 02-04, pp. 1-18.
• Y. Gang Wang and S. Wang, “Research and Implementation on Spatial Data Storage and Operation Based on Hadoop Platform”,
Second IITA International Conference on Geoscience and Remote Sensing, Qingdao China, (2010) August 28-31, pp. 275-278.
• J. Cui, C. Li, C. Xing and Y. Zhang, “The framework of a distributed file system for geospatial data management”, Proceedings of IEEE
CCIS, (2011), pp. 183-187.
• C. Lam, “Hadoop in Action”, Manning Publications, (2010).
• K. Shvachko, K. Hairong, S. Radia and R. Chansler, “The Hadoop Distributed File System”, IEEE 26th Symposium on Mass Storage
Systems and Technologies (MSST), Washington, DC: IEEE Computer Society, (2010) May 3-7, pp. 1-10
• D. Borthakur, “The Hadoop Distributed File System: Architecture and design”, (2008).
• M. Loukides and J. Steele, “HBase: The Definitive Guide”, Published by O’Reilly Media: First Edition, (2009) September.
• OGC, http://www.opengeospatial.org/standards, (2012) April 31.