OSM data in MariaDB / MySQL - All the world in a few large tableshholzgra
The document discusses storing OpenStreetMap data in MariaDB/MySQL. It provides an overview of the spatial capabilities of MariaDB/MySQL, describes the OpenStreetMap data model and how data is accessed. It also explains how the osm2pgsql tool can be used to import OSM data into MySQL tables and discusses performance. The document concludes by noting some examples and references for further information.
The document discusses Dremel, an interactive query system for analyzing large-scale datasets. Dremel uses a columnar data storage format and a multi-level query execution tree to enable fast querying. It evaluates Dremel's performance on interactive queries, showing it can count terms in a field within seconds using 3000 workers, while MapReduce takes hours. Dremel also scales linearly and handles stragglers well. Today, similar systems like Google BigQuery and Apache Drill use Dremel-like techniques for interactive analysis of web-scale data.
Dremel interactive analysis of web scale datasetsCarl Lu
Dremel is an interactive query system that can analyze large web-scale datasets containing trillions of records in seconds. It uses a columnar data layout and multi-level query execution trees to distribute queries across thousands of servers. Dremel's nested data model and column-striped storage allows it to efficiently retrieve and analyze only the necessary columns from large datasets. Experimental results demonstrated Dremel's ability to process queries over datasets containing trillions of records and petabytes of data in seconds using a cluster of thousands of servers.
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
The document describes Dremel, an interactive analysis system for web-scale datasets. Dremel uses a columnar data storage model and tree-based query serving architecture to enable interactive analysis of trillion record datasets distributed across thousands of nodes. It provides an SQL-like interface and can process queries orders of magnitude faster than traditional MapReduce systems by avoiding record assembly costs. Experiments show Dremel can analyze tens to hundreds of billions of records interactively on commodity hardware.
This document summarizes the results of a benchmark test comparing the performance of GeoServer and MapServer web map server (WMS) implementations against different data backends and workloads. Key findings include that GeoServer was generally faster than MapServer at reading shapefiles and rendering plain polygons. Performance was similar between the two when using PostGIS and Oracle spatial backends. MapServer showed improved performance for labelled roads rendering compared to previous tests. Areas for potential improvement in future tests are also discussed.
Time-Evolving Graph Processing On Commodity ClustersJen Aman
Tegra is a system for efficiently processing time-evolving graphs on commodity clusters. It uses a distributed graph snapshot index to represent and retrieve multiple snapshots of evolving graphs. It introduces a timelapse abstraction to perform temporal analytics on windows of snapshots, avoiding redundant computation. Tegra supports both bulk and incremental graph computations using this representation, allowing results to be reused when graphs are updated. An evaluation on real-world graphs shows Tegra can store more snapshots in memory and reduce computation time compared to baseline approaches.
OSM data in MariaDB / MySQL - All the world in a few large tableshholzgra
The document discusses storing OpenStreetMap data in MariaDB/MySQL. It provides an overview of the spatial capabilities of MariaDB/MySQL, describes the OpenStreetMap data model and how data is accessed. It also explains how the osm2pgsql tool can be used to import OSM data into MySQL tables and discusses performance. The document concludes by noting some examples and references for further information.
The document discusses Dremel, an interactive query system for analyzing large-scale datasets. Dremel uses a columnar data storage format and a multi-level query execution tree to enable fast querying. It evaluates Dremel's performance on interactive queries, showing it can count terms in a field within seconds using 3000 workers, while MapReduce takes hours. Dremel also scales linearly and handles stragglers well. Today, similar systems like Google BigQuery and Apache Drill use Dremel-like techniques for interactive analysis of web-scale data.
Dremel interactive analysis of web scale datasetsCarl Lu
Dremel is an interactive query system that can analyze large web-scale datasets containing trillions of records in seconds. It uses a columnar data layout and multi-level query execution trees to distribute queries across thousands of servers. Dremel's nested data model and column-striped storage allows it to efficiently retrieve and analyze only the necessary columns from large datasets. Experimental results demonstrated Dremel's ability to process queries over datasets containing trillions of records and petabytes of data in seconds using a cluster of thousands of servers.
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
The document describes Dremel, an interactive analysis system for web-scale datasets. Dremel uses a columnar data storage model and tree-based query serving architecture to enable interactive analysis of trillion record datasets distributed across thousands of nodes. It provides an SQL-like interface and can process queries orders of magnitude faster than traditional MapReduce systems by avoiding record assembly costs. Experiments show Dremel can analyze tens to hundreds of billions of records interactively on commodity hardware.
This document summarizes the results of a benchmark test comparing the performance of GeoServer and MapServer web map server (WMS) implementations against different data backends and workloads. Key findings include that GeoServer was generally faster than MapServer at reading shapefiles and rendering plain polygons. Performance was similar between the two when using PostGIS and Oracle spatial backends. MapServer showed improved performance for labelled roads rendering compared to previous tests. Areas for potential improvement in future tests are also discussed.
Time-Evolving Graph Processing On Commodity ClustersJen Aman
Tegra is a system for efficiently processing time-evolving graphs on commodity clusters. It uses a distributed graph snapshot index to represent and retrieve multiple snapshots of evolving graphs. It introduces a timelapse abstraction to perform temporal analytics on windows of snapshots, avoiding redundant computation. Tegra supports both bulk and incremental graph computations using this representation, allowing results to be reused when graphs are updated. An evaluation on real-world graphs shows Tegra can store more snapshots in memory and reduce computation time compared to baseline approaches.
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
This document discusses enhancements to the Spark SQL optimizer through improved statistics collection and cost-based optimization rules. It describes collecting table and column statistics from Hive metastore and developing 1D and 2D histograms. New rules estimate operator costs based on output rows and size. Join order, filter statistics, and handling unique columns are discussed. Future work includes faster histogram collection, expression statistics, and continuous feedback optimization.
Surge: Rise of Scalable Machine Learning at Yahoo!DataWorks Summit
Andy Feng discusses Yahoo's use of scalable machine learning for search and advertisement applications with massive datasets and features. Three machine learning algorithms - gradient boosted decision trees, logistic regression, and ad-query vectors - presented challenges of scale that were addressed using Hadoop and YARN across hundreds of servers. Approximate computing techniques like streaming, distributed training, and in-memory processing enabled speedups of 30x to 1000x and scaling to billions of examples and terabytes of data, allowing daily model training. Hadoop and distributed processing on CPU and GPU resources were critical to solving Yahoo's needs for scalable machine learning on big data.
GeoMesa is an open-source project that provides scalable geospatial analytics on large datasets. It allows querying and analyzing data stored in Apache Accumulo using a geospatial index. GeoMesa implements the GeoTools API and supports point, line, polygon, raster, and time-enabled data through flexible space-filling curves. It enables distributed computation and analytics through features like multi-step query planning, secondary indexes, and integration with frameworks like Spark and streaming APIs. The project is developed and supported by a community including LocationTech.
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...PingCAP
This document discusses methods for optimizing query performance in a query optimizer called Scope by selecting alternative rule configurations. It proposes using rule signatures to group similar queries and generate candidate rule configurations to execute for each group. A learning model is then trained on execution results to select the best configuration for future queries in each group. The goal is to improve upon the default configuration by adapting to workloads and addressing inaccuracies in cardinality estimation that can lead to suboptimal plans.
1) The document discusses generalized linear models (GLM) using H2O. GLM is a well-known statistical method that fits a linear model to predict outcomes.
2) H2O enables distributed, parallel GLM on large datasets with billions of data points. It supports standard GLM features like regularization to prevent overfitting.
3) An example demonstrates predicting flight delays using airline data with 116 million rows. GLM and deep learning models are fit in seconds on H2O using an 8-node cluster.
This topic was presented by Martin Kersten (CWI) at the 4th International Workshop on Testing Database Systems (DBTest 2011) on June 13th, 2011 in Athens, Greece.
Publication: http://bit.ly/yK5JZk
Abstract:
Robustness of database systems under stress is hard to quantify, because there are many factors involved, most notably the user expectation to perform a job within certain bounds of the user requirements. Nevertheless, robustness of database system is very important to end users. In this paper we develop a database benchmark suite, inspired by tractor pulling, where robustness is measured as a system's ability to process data despite a continuous increase in system load, as defined in terms of data volume, query volume and complexity. A functional evaluation is performed against several systems to highlight the benchmark capabilities.
The document discusses strategies for building high-speed data warehouses, including upgrading hardware, extending existing warehouses with additional data marts or appliances, and migrating to new high-speed solutions. It provides examples of new solutions that use massively parallel processing and column-based storage to improve query performance. Benchmark results are presented showing significant performance gains from these approaches for analyzing very large datasets.
This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented.
http://clds.sdsc.edu/wbdb2014.de/program
WMS Benchmarking presentation and results, from the FOSS4G 2011 event in Denver. 6 different development teams participated in this exercise, to display common data through the WMS standard the fastest. http://2011.foss4g.org/sessions/web-mapping-performance-shootout
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.
Big Linked Data Interlinking - ExtremeEarth Open WorkshopExtremeEarth
This document discusses approaches for interlinking big linked geospatial data. It describes detecting proximity and topological relations between geospatial entities. Examples of topological relations identified include a linestring touching a polygon, a linestring intersecting another linestring, and a polygon containing another polygon. The document also discusses challenges like quadratic time complexity and introduces techniques like progressive and approximate geospatial interlinking to improve efficiency and scalability.
Why PostgreSQL for Analytics Infrastructure (DW)?Huy Nguyen
Talk given at Grokking TechTalk 14 - Database Systems
Why starting out with Postgres for reporting/analytics database.
Huy Nguyen
CTO of Holistics.io - BI & Infrastructure SaaS.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Accumulo Collections is a lightweight library that dramatically simplifies development of fast NoSQL applications by encapsulating many powerful, distributed features of Accumulo in the familiar Java Collections interface. Accumulo is a giant sorted map with rich server-side functionality, and our AccumuloSortedMap is a robust java SortedMap implementation that is backed by an Accumulo table. It handles serialization and foreign keys, and provides extensive server-side features like entry timeout, aggregates, filtering, efficient one-to-many mapping, partitioning and sampling. Users can define custom server-side transformations and aggregates with Accumulo iterators.
More information on this project can be found on github at: https://github.com/isentropy/accumulo-collections/wiki
– Speaker –
Jonathan Wolff
Founder, Director of Engineering, Isentropy LLC
Jonathan is an ex-physicist who operates a consultancy specializing in big data and data science project work. He worked for Bloomberg last year and built their Accumulo File System, which was presented as 2015 Accumulo Summit's keynote speech. He's also done distributed computing project work for Yahoo! in Pig.
Jonathan holds a BA in Physics (Harvard, magna cum laude 2001) and an MS in Mechanical Engineering (Columbia, 2003), and has been avidly programming since the 1980's.
— More Information —
For more information see http://www.accumulosummit.com/
This document discusses challenges in processing large graphs and introduces an approach called GraphLego. It describes how GraphLego models large graphs as 3D cubes partitioned into slices, strips and dices to balance parallel computation. GraphLego optimizes access locality by minimizing disk access and compressing partitions. It also uses regression-based learning to optimize partitioning parameters and runtime. The document evaluates GraphLego on real-world graphs, finding it outperforms existing single-machine graph processing systems in execution efficiency and partitioning decisions.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
The document describes KeystoneML, an open source software framework for building scalable machine learning pipelines on Apache Spark. It discusses standard machine learning pipelines and examples of more complex pipelines for image classification, text classification, and recommender systems. It covers features of KeystoneML like transformers, estimators, and chaining estimators and transformers. It also discusses optimizing pipelines by choosing solvers, caching intermediate data, and operator selection. Benchmark results show KeystoneML achieves state-of-the-art accuracy on large datasets faster than other systems through end-to-end pipeline optimizations.
LocationTech is an Eclipse Foundation industry working group for location aware technologies. This presentation introduces LocationTech, looks at what it means for our industry and the participating projects.
Libraries: JTS Topology Suite is the rocket science of GIS providing an implementation of Geometry. Mobile Map Tools provides a C++ foundation that is translated into Java and Javascript for maps on iOS, Andriod and WebGL. GeoMesa is a distributed key/value store based on Accumulo. Spatial4j integrates with JTS to provide Geometry on curved surface.
Process: GeoTrellis real-time distributed processing used scala, akka and spark. GeoJinni mixes spatial data/indexing with Hadoop.
Applications: GEOFF offers OpenLayers 3 as a SWT component. GeoGit distributed revision control for feature data. GeoScipt brings spatial data to Groovy, JavaScript, Python and Scala. uDig offers an eclipse based desktop GIS solution.
Attend this presentation if want to know what LocationTech is about, are interested in these projects or curious about what projects will be next.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
This document summarizes machine learning scalability from single machine to distributed systems. It discusses how true scalability is about how long it takes to reach a target accuracy level using any available hardware resources. It introduces GraphLab Create and SFrame/SGraph for scalable machine learning and graph processing. Key points include distributed optimization techniques, graph partitioning strategies, and benchmarks showing GraphLab Create can solve problems faster than other systems by using fewer machines.
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
This document discusses enhancements to the Spark SQL optimizer through improved statistics collection and cost-based optimization rules. It describes collecting table and column statistics from Hive metastore and developing 1D and 2D histograms. New rules estimate operator costs based on output rows and size. Join order, filter statistics, and handling unique columns are discussed. Future work includes faster histogram collection, expression statistics, and continuous feedback optimization.
Surge: Rise of Scalable Machine Learning at Yahoo!DataWorks Summit
Andy Feng discusses Yahoo's use of scalable machine learning for search and advertisement applications with massive datasets and features. Three machine learning algorithms - gradient boosted decision trees, logistic regression, and ad-query vectors - presented challenges of scale that were addressed using Hadoop and YARN across hundreds of servers. Approximate computing techniques like streaming, distributed training, and in-memory processing enabled speedups of 30x to 1000x and scaling to billions of examples and terabytes of data, allowing daily model training. Hadoop and distributed processing on CPU and GPU resources were critical to solving Yahoo's needs for scalable machine learning on big data.
GeoMesa is an open-source project that provides scalable geospatial analytics on large datasets. It allows querying and analyzing data stored in Apache Accumulo using a geospatial index. GeoMesa implements the GeoTools API and supports point, line, polygon, raster, and time-enabled data through flexible space-filling curves. It enables distributed computation and analytics through features like multi-step query planning, secondary indexes, and integration with frameworks like Spark and streaming APIs. The project is developed and supported by a community including LocationTech.
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...PingCAP
This document discusses methods for optimizing query performance in a query optimizer called Scope by selecting alternative rule configurations. It proposes using rule signatures to group similar queries and generate candidate rule configurations to execute for each group. A learning model is then trained on execution results to select the best configuration for future queries in each group. The goal is to improve upon the default configuration by adapting to workloads and addressing inaccuracies in cardinality estimation that can lead to suboptimal plans.
1) The document discusses generalized linear models (GLM) using H2O. GLM is a well-known statistical method that fits a linear model to predict outcomes.
2) H2O enables distributed, parallel GLM on large datasets with billions of data points. It supports standard GLM features like regularization to prevent overfitting.
3) An example demonstrates predicting flight delays using airline data with 116 million rows. GLM and deep learning models are fit in seconds on H2O using an 8-node cluster.
This topic was presented by Martin Kersten (CWI) at the 4th International Workshop on Testing Database Systems (DBTest 2011) on June 13th, 2011 in Athens, Greece.
Publication: http://bit.ly/yK5JZk
Abstract:
Robustness of database systems under stress is hard to quantify, because there are many factors involved, most notably the user expectation to perform a job within certain bounds of the user requirements. Nevertheless, robustness of database system is very important to end users. In this paper we develop a database benchmark suite, inspired by tractor pulling, where robustness is measured as a system's ability to process data despite a continuous increase in system load, as defined in terms of data volume, query volume and complexity. A functional evaluation is performed against several systems to highlight the benchmark capabilities.
The document discusses strategies for building high-speed data warehouses, including upgrading hardware, extending existing warehouses with additional data marts or appliances, and migrating to new high-speed solutions. It provides examples of new solutions that use massively parallel processing and column-based storage to improve query performance. Benchmark results are presented showing significant performance gains from these approaches for analyzing very large datasets.
This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented.
http://clds.sdsc.edu/wbdb2014.de/program
WMS Benchmarking presentation and results, from the FOSS4G 2011 event in Denver. 6 different development teams participated in this exercise, to display common data through the WMS standard the fastest. http://2011.foss4g.org/sessions/web-mapping-performance-shootout
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.
Big Linked Data Interlinking - ExtremeEarth Open WorkshopExtremeEarth
This document discusses approaches for interlinking big linked geospatial data. It describes detecting proximity and topological relations between geospatial entities. Examples of topological relations identified include a linestring touching a polygon, a linestring intersecting another linestring, and a polygon containing another polygon. The document also discusses challenges like quadratic time complexity and introduces techniques like progressive and approximate geospatial interlinking to improve efficiency and scalability.
Why PostgreSQL for Analytics Infrastructure (DW)?Huy Nguyen
Talk given at Grokking TechTalk 14 - Database Systems
Why starting out with Postgres for reporting/analytics database.
Huy Nguyen
CTO of Holistics.io - BI & Infrastructure SaaS.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Accumulo Collections is a lightweight library that dramatically simplifies development of fast NoSQL applications by encapsulating many powerful, distributed features of Accumulo in the familiar Java Collections interface. Accumulo is a giant sorted map with rich server-side functionality, and our AccumuloSortedMap is a robust java SortedMap implementation that is backed by an Accumulo table. It handles serialization and foreign keys, and provides extensive server-side features like entry timeout, aggregates, filtering, efficient one-to-many mapping, partitioning and sampling. Users can define custom server-side transformations and aggregates with Accumulo iterators.
More information on this project can be found on github at: https://github.com/isentropy/accumulo-collections/wiki
– Speaker –
Jonathan Wolff
Founder, Director of Engineering, Isentropy LLC
Jonathan is an ex-physicist who operates a consultancy specializing in big data and data science project work. He worked for Bloomberg last year and built their Accumulo File System, which was presented as 2015 Accumulo Summit's keynote speech. He's also done distributed computing project work for Yahoo! in Pig.
Jonathan holds a BA in Physics (Harvard, magna cum laude 2001) and an MS in Mechanical Engineering (Columbia, 2003), and has been avidly programming since the 1980's.
— More Information —
For more information see http://www.accumulosummit.com/
This document discusses challenges in processing large graphs and introduces an approach called GraphLego. It describes how GraphLego models large graphs as 3D cubes partitioned into slices, strips and dices to balance parallel computation. GraphLego optimizes access locality by minimizing disk access and compressing partitions. It also uses regression-based learning to optimize partitioning parameters and runtime. The document evaluates GraphLego on real-world graphs, finding it outperforms existing single-machine graph processing systems in execution efficiency and partitioning decisions.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
The document describes KeystoneML, an open source software framework for building scalable machine learning pipelines on Apache Spark. It discusses standard machine learning pipelines and examples of more complex pipelines for image classification, text classification, and recommender systems. It covers features of KeystoneML like transformers, estimators, and chaining estimators and transformers. It also discusses optimizing pipelines by choosing solvers, caching intermediate data, and operator selection. Benchmark results show KeystoneML achieves state-of-the-art accuracy on large datasets faster than other systems through end-to-end pipeline optimizations.
LocationTech is an Eclipse Foundation industry working group for location aware technologies. This presentation introduces LocationTech, looks at what it means for our industry and the participating projects.
Libraries: JTS Topology Suite is the rocket science of GIS providing an implementation of Geometry. Mobile Map Tools provides a C++ foundation that is translated into Java and Javascript for maps on iOS, Andriod and WebGL. GeoMesa is a distributed key/value store based on Accumulo. Spatial4j integrates with JTS to provide Geometry on curved surface.
Process: GeoTrellis real-time distributed processing used scala, akka and spark. GeoJinni mixes spatial data/indexing with Hadoop.
Applications: GEOFF offers OpenLayers 3 as a SWT component. GeoGit distributed revision control for feature data. GeoScipt brings spatial data to Groovy, JavaScript, Python and Scala. uDig offers an eclipse based desktop GIS solution.
Attend this presentation if want to know what LocationTech is about, are interested in these projects or curious about what projects will be next.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
This document summarizes machine learning scalability from single machine to distributed systems. It discusses how true scalability is about how long it takes to reach a target accuracy level using any available hardware resources. It introduces GraphLab Create and SFrame/SGraph for scalable machine learning and graph processing. Key points include distributed optimization techniques, graph partitioning strategies, and benchmarks showing GraphLab Create can solve problems faster than other systems by using fewer machines.
An introduction to column store indexes and batch modeChris Adkin
This document discusses column store databases and how they work. It explains that column store databases store data by column rather than row to better utilize modern CPU architectures. It describes how column stores use compression techniques like run-length encoding and dictionaries. It also demonstrates how batch processing and sorting data can improve performance of queries against column stores by keeping more data in CPU caches.
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
MySQL NDB Cluster running SQL faster than most NoSQL databases. Benchmark results, comparisons and introduction into NDB's parallel distributed in-memory query engine. MySQL Day before FOSDEM 2020.
This is the speech Shen Li gave at GopherChina 2017.
TiDB is an open source distributed database. Inspired by the design of Google F1/Spanner, TiDB features in infinite horizontal scalability, strong consistency, and high availability. The goal of TiDB is to serve as a one-stop solution for data storage and analysis.
In this talk, we will mainly cover the following topics:
- What is TiDB
- TiDB Architecture
- SQL Layer Internal
- Golang in TiDB
- Next Step of TiDB
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
A Brief Introduction of TiDB (Percona Live)PingCAP
TiDB is an open-source distributed SQL database that supports high availability, horizontal scalability, and consistent distributed transactions. It provides a MySQL compatible API and seamless online expansion. TiDB uses Raft for consensus and implements the MVCC model to support high concurrency. It also provides distributed transactions through a two-phase commit protocol. The architecture consists of a stateless SQL layer (TiDB) and a distributed transactional key-value storage (TiKV).
This is a summary of the sessions I attended at PASS Summit 2017. Out of the week-long conference, I put together these slides to summarize the conference and present at my company. The slides are about my favorite sessions that I found had the most value. The slides included screenshotted demos I personally developed and tested alike the speakers at the conference.
Fast Big Data Analytics with Spark on TachyonAlluxio, Inc.
This document discusses using Tachyon as a cache layer to accelerate big data analytics workloads at Baidu. It describes how Tachyon provides a 30x speedup over directly accessing remote storage by serving data from an in-memory cache. The document outlines Baidu's architecture using Tachyon with Spark SQL, discusses problems encountered in production and their solutions, and proposes advanced features like tiered storage to expand Tachyon's caching capacity.
MariaDB ColumnStore is a high performance columnar storage engine for MariaDB that supports analytical workloads on large datasets. It uses a distributed, massively parallel architecture to provide faster and more efficient queries. Data is stored column-wise which improves compression and enables fast loading and filtering of large datasets. The cpimport tool allows loading data into MariaDB ColumnStore in bulk from CSV files or other sources, with options for centralized or distributed parallel loading. Proper sizing of ColumnStore deployments depends on factors like data size, workload, and hardware specifications.
This document discusses scaling machine learning using Apache Spark. It covers several key topics:
1) Parallelizing machine learning algorithms and neural networks to distribute computation across clusters. This includes data, model, and parameter server parallelism.
2) Apache Spark's Resilient Distributed Datasets (RDDs) programming model which allows distributing data and computation across a cluster in a fault-tolerant manner.
3) Examples of very large neural networks trained on clusters, such as a Google face detection model using 1,000 servers and a IBM brain-inspired chip model using 262,144 CPUs.
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
This document provides best practices for migrating a data warehouse to Amazon Redshift. It discusses why companies migrate to Redshift due to its scalability, performance and cost advantages. Example migration stories are provided from companies that achieved significant improvements after migrating large datasets from Oracle, Greenplum and SQL on Hadoop to Redshift. The document also outlines the Redshift cluster architecture, data loading best practices including file splitting and column encoding, schema design considerations and available migration tools.
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...Intel® Software
World of Tanks has been optimized for modern multicore CPUs and enriched with improved graphics and physics. The game now uses Threading Building Blocks for multithreaded rendering and physics simulations like destructions and tank treads. Destructions use Havok for high quality simulation of object destruction across CPU cores. The improved tank treads use a spring chain simulation for realistic shapes and collisions with the environment. Concurrent rendering separates work into command lists across subsystems and a separate thread handles graphics API calls.
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process.
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
The document discusses accelerating Reed-Solomon erasure codes on GPUs. It aims to accelerate two main computation bottlenecks: arithmetic operations in Galois fields and matrix multiplication. For Galois field operations, it evaluates loop-based and table-based methods and chooses a log-exponential table approach. It also proposes tiling algorithms to optimize matrix multiplication on GPUs by reducing data transfers and improving memory access patterns. The goal is to make Reed-Solomon encoding and decoding faster for cloud storage systems using erasure codes.
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
In this talk, we will present how we analyze, predict, and visualize network quality data, as a spark AI use case in a telecommunications company. SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day.
In order to address previous problems of Spark based on HDFS, we have developed a new data store for SparkSQL consisting of Redis and RocksDB that allows us to distribute and store these data in real time and analyze it right away, We were not satisfied with being able to analyze network quality in real-time, we tried to predict network quality in near future in order to quickly detect and recover network device failures, by designing network signal pattern-aware DNN model and a new in-memory data pipeline from spark to tensorflow.
In addition, by integrating Apache Livy and MapboxGL to SparkSQL and our new store, we have built a geospatial visualization system that shows the current population and signal strength of 300,000 cells on the map in real time.
Managing Data and Operation Distribution In MongoDBJason Terpko
In a sharded MongoDB cluster, scale and data distribution are defined by your shard keys. Even when choosing the correct shards key, ongoing maintenance and review can still be required to maintain optimal performance.
This presentation will review shard key selection and how the distribution of chunks can create scenarios where you may need to manually move, split, or merge chunks in your sharded cluster. Scenarios requiring these actions can exist with both optimal and sub-optimal shard keys. Example use cases will provide tips on selection of shard key, detecting an issue, reasons why you may encounter these scenarios, and specific steps you can take to rectify the issue.
Optimising Geospatial Queries with Dynamic File PruningDatabricks
One of the most significant benefits provided by Databricks Delta is the ability to use z-ordering and dynamic file pruning to significantly reduce the amount of data that is retrieved from blob storage and therefore drastically improve query times, sometimes by an order of magnitude.
This presentation recounts the story of Macys.com and Bloomingdales.com's migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
This session will cover:
1) The process that led to our decision to use Cassandra
2) The approach we used for migrating from DB2 & Coherence to Cassandra without disrupting the production environment
3) The various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks, as well as how these performance results figured into our final schema designs.
4) Our lessons learned and next steps
Tony Gibbs gave a presentation on Amazon Redshift covering its history, architecture, concepts, and parallelism. The presentation included details on Redshift's cluster architecture, node components, storage design, data distribution styles, and terminology. It also provided a deep dive on parallelism in Redshift, explaining how queries are compiled and executed through streams, segments, and steps to enable massively parallel processing across nodes.
Similar to G-Store: High-Performance Graph Store for Trillion-Edge Processing (20)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
University of New South Wales degree offer diploma Transcript
G-Store: High-Performance Graph Store for Trillion-Edge Processing
1. G-Store:
High Performance Graph Store for
Trillion-edge Processing
Pradeep Kumar and H. Howie Huang
The George Washington University
2. Introduction
Graph is everywhere
Applications include
• Social Networks
• Computer Networks
• Protein Structures
• Recommendation Systems
• And many more …
Graph size is reaching to Trillion edges
2
10. Tile Advantages
G-Store achieves upto 8x space saving
• 2x due to Symmetry
• Upto 4x due to Smallest Number of Bit representation (SNB)
Processing can run directly on top of the SNB format with no need
for conversion
• Use a new relative pointer per tile for each algorithmic metadata
• Please refer to the paper for details
10
12. Tile Properties
Tile size of Twitter graph
• Power law distribution
Tile metadata size
• All smaller than LLC
12
Algorithm Metadata Size
Page-Rank 256KB
Connected Components 256KB
BFS 64KB
13. Grouping and On-disk Layout
Combine q*q tiles into a physical
group
Layout of tiles in disk
• Reading tiles sequentially provides better
hit ratios on LLC
Tile is a basic unit for Data Fetching,
Processing and Caching
Use Linux AIO
13
0 q 2*q g = p/q
0
q
2*q
PageRank-value
Physical-group
g = p/q
Disk-layout of tiles in a physical group
15. Proactive Caching
Divide memory into streaming and caching
Rule 1: at the end of the processing of row[i]
• We know whether row[i] would be processed in the next iteration
• Hints can be used later
Rule 2: if row[i] is not needed for next iteration
• Then we know that Tile[i,i] will not be needed
• We only have partial information about other individual tiles
15
(0, 1), (0, 3),
(1, 2)
(0, 4), (1, 4),
(2, 4)
Tile[0, 0] Tile[0, 1]
(4, 5), (5, 6),
(5, 7)
Tile[1, 1]
Row[0]
Row[1]
17. Slide-cache-rewind
17
Cache pool (useful data ) IO(S1) Processing(S2)
Proactive Caching
Ti+1
Cache pool (useful data ) Processing ProcessedTn
Cache pool (processing) Processing Unprocessed
Cache-pool IO(S1) Processing(S2)
(T+1)0
(T+1)1
Proactive Caching
Rewind
Less IO due to reuse
Provide hints for proactive caching policy
Evict unwanted data from cache pool
18. Experimental Setup
Dual Intel Xeon CPU E5-2683, 14 core each, hyper-threading
enabled = 56 threads
8GB streaming and caching memory
Cache
• 32KB data and instruction L1 cache
• 256KB L2 cache
• 16MB L3 cache
LSI SAS9300-8i HBA, 8 Samsung EVO 850 512GB SSD
Linux Software RAID0 with 64KB stripe size
18
20. Performance on Trillion-edge Graph
Kron-31-256: 231 vertices with 240 edges
Kron-33-16: 233 vertices with 238 edges
One iteration of Pagerank:
• G-Store: 14 min in a single machine for a trillion-edge graph
• Ching et al*: 3 min using 200 machines for similar-sized graph
*Ching et al. One Trillion Edges: Graph Processing at Facebook-scale. Proceedings of the 41st
International Conference on Very Large Data Bases(VLDB15)
20
Graph BFS Page-rank WCC
Kron-31-256 2548.54 4214.54 1925.13
Kron-33-16 1509.13 1882.88 849.05
21. Performance Comparison
Over X-Stream
• 17x(BFS), 21x (PageRank), 32x(CC/WCC) speedup for Kron-28-16
• Others are similar. See paper for details.
21
0
0.5
1
1.5
2
2.5
Speedup
BFS Pagerank CC/WCC
FlashGraph G-Store
Over FlashGraph
• 1.5x (CC/WCC), 2x (PageRank), 1.4x (BFS, undirected)
• Slightly poor for BFS on directed graph due to no space saving in smaller
graph
22. Scalability on SSDs
RAID0: 64K stripe size
Close to 4x speedup for 4-SSDs
Upto 6x speedup for 8-SSDs
• PageRank becomes compute intensive at 8-SSD configuration
22
0
2
4
6
8
BFS Pagerank WCC
Speedup
1 SSD 2 SSDs 4 SSDs 8 SSDs
24. Cache Size
For Kron-28-16 graph from 1GB to 8GB
• Average 30% speedup
For Twitter graph from 1GB to 4GB
• Average 41% speedup
24
0
0.5
1
1.5
2
BFS Pagerank WCC BFS Pagerank WCC
Speedup
Kron-28-16 Twitter
1GB 2GB 4GB 8GB
25. Conclusion
SNB: Space efficient representation
Grouping: Optimal utilization of hardware cache
Slide: Complete overlapping of IO and compute
Cache: Graph specific proactive caching policy
Rewind: Using the last drop of memory
25
26. Thank You
Email: pradeepk@gwu.edu and howie@gwu.edu
Graph software repository
• https://github.com/iHeartGraph/
• G-Store: High-Performance Graph Store for Trillion-Edge
Processing (SC’16)
• Enterprise: Breadth-First Graph Traversal on GPUs (SC’15)
• iBFS: Concurrent Breadth-First Search on GPUs (SIGMOD’16)
26