The document discusses storing terrestrial LiDAR data in a spatial database framework. It describes setting up a PostgreSQL database with a PostGIS extension to store LiDAR point cloud data in a hierarchical folder structure based on survey dates and locations. Issues with large data uploads are addressed through experiments comparing the PostgreSQL COPY method to the pg_bulkload method, finding pg_bulkload significantly faster for importing large LiDAR datasets. The spatial database allows efficient querying of LiDAR data by location or other attributes.
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
This document summarizes a presentation about using CUDA (Compute Unified Device Architecture) to accelerate lattice quantum chromodynamics (QCD) calculations. CUDA is used to parallelize computations across many GPU threads. Each thread processes one lattice site, with neighboring sites and links accessed sequentially. Initially, each thread required 1.4KB of local storage, limiting occupancy. Occupancy was improved by storing data in registers instead of shared memory, expanding loops explicitly. This achieved up to 82 gigabytes per second on a GTX 280, 20 times faster than CPUs. Memory access patterns, float4 arrays, and textures were optimized to improve bandwidth utilization.
This document summarizes a project report on implementing AES encryption in parallel. It describes how AES works sequentially and the approach taken to parallelize it by assigning each processing element a portion of the data to encrypt in parallel. Experimental results show speedups from parallelization and analysis of running times for different file sizes and numbers of processing elements. Future work is proposed to make the program more space efficient and properly recover the ciphertext.
The document discusses how the shape of object graphs in the heap can impact the scalability of garbage collection on highly parallel systems. Deep and narrow object graphs, where most objects have a high depth, can cause poor load balancing and low processor utilization during parallel tracing. The document analyzes the depth and shape of object graphs for several Java benchmarks and finds some have very deep and narrow graphs that could limit scalability. It proposes two solutions to improve scalability by reshaping object graphs or modifying parallel tracing.
MySQL replication is the backbone of the web economy, but it has shortcomings. Tungtsten Replicator, an open source replication engine, takes MySQL replication to the next level with multiple masters, seamless failover, parallel replication.
TROSGi is a framework that enhances OSGi with real-time Java support through three integration levels (L0, L1, L2). L0 provides minimal access to real-time Java. L1 adds a real-time characterization service. L2 introduces admission control, fault tolerance, and composition services. The services were implemented and tested on a real-time Java prototype, showing improved performance over non-real-time approaches. Ongoing work focuses on further implementation improvements and integrating other languages.
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features
The document discusses caching and new features in Ehcache 2 and Hibernate caching. It describes reasons for caching like offloading resources, improving performance, and enabling scalability. It discusses how caching works by leveraging principles of locality of reference and Pareto distributions. It also covers challenges like data size, staleness, and maintaining coherency in a clustered environment.
This document discusses cache and concurrency considerations for Apache Cassandra. It covers metrics and monitors for cache performance, how the JVM performs in big data systems, examples of Cassandra in real-world systems like Facebook and Twitter, techniques for achieving fast writes and reads, and tools for optimizing performance. It emphasizes locality, non-blocking collections, and techniques for handling garbage collection and compactions efficiently.
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
This document summarizes a presentation about using CUDA (Compute Unified Device Architecture) to accelerate lattice quantum chromodynamics (QCD) calculations. CUDA is used to parallelize computations across many GPU threads. Each thread processes one lattice site, with neighboring sites and links accessed sequentially. Initially, each thread required 1.4KB of local storage, limiting occupancy. Occupancy was improved by storing data in registers instead of shared memory, expanding loops explicitly. This achieved up to 82 gigabytes per second on a GTX 280, 20 times faster than CPUs. Memory access patterns, float4 arrays, and textures were optimized to improve bandwidth utilization.
This document summarizes a project report on implementing AES encryption in parallel. It describes how AES works sequentially and the approach taken to parallelize it by assigning each processing element a portion of the data to encrypt in parallel. Experimental results show speedups from parallelization and analysis of running times for different file sizes and numbers of processing elements. Future work is proposed to make the program more space efficient and properly recover the ciphertext.
The document discusses how the shape of object graphs in the heap can impact the scalability of garbage collection on highly parallel systems. Deep and narrow object graphs, where most objects have a high depth, can cause poor load balancing and low processor utilization during parallel tracing. The document analyzes the depth and shape of object graphs for several Java benchmarks and finds some have very deep and narrow graphs that could limit scalability. It proposes two solutions to improve scalability by reshaping object graphs or modifying parallel tracing.
MySQL replication is the backbone of the web economy, but it has shortcomings. Tungtsten Replicator, an open source replication engine, takes MySQL replication to the next level with multiple masters, seamless failover, parallel replication.
TROSGi is a framework that enhances OSGi with real-time Java support through three integration levels (L0, L1, L2). L0 provides minimal access to real-time Java. L1 adds a real-time characterization service. L2 introduces admission control, fault tolerance, and composition services. The services were implemented and tested on a real-time Java prototype, showing improved performance over non-real-time approaches. Ongoing work focuses on further implementation improvements and integrating other languages.
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the "next generation" features
The document discusses caching and new features in Ehcache 2 and Hibernate caching. It describes reasons for caching like offloading resources, improving performance, and enabling scalability. It discusses how caching works by leveraging principles of locality of reference and Pareto distributions. It also covers challenges like data size, staleness, and maintaining coherency in a clustered environment.
This document discusses cache and concurrency considerations for Apache Cassandra. It covers metrics and monitors for cache performance, how the JVM performs in big data systems, examples of Cassandra in real-world systems like Facebook and Twitter, techniques for achieving fast writes and reads, and tools for optimizing performance. It emphasizes locality, non-blocking collections, and techniques for handling garbage collection and compactions efficiently.
Introduction of Java GC Tuning and Java Java Mission ControlLeon Chen
This document provides an introduction and overview of Java garbage collection (GC) tuning and the Java Mission Control tool. It begins with information about the speaker, Leon Chen, including his background and patents. It then outlines the Java and JVM roadmap and upcoming features. The bulk of the document discusses GC tuning concepts like heap sizing, generation sizing, footprint vs throughput vs latency. It provides examples and recommendations for GC logging, analysis tools like GCViewer and JWorks GC Web. The document is intended to outline Oracle's product direction and future plans for Java GC tuning and tools.
This document proposes an extension to the Real-Time Specification for Java (RTSJ) called RealtimeThread++. It aims to simplify the dual threading model of RTSJ by having a single thread type that can dynamically decide whether to run with or without garbage collector interference. This added flexibility could help avoid garbage collector priority inversions, enrich event handling programming, and enhance real-time distributed architectures in the Distributed RTSJ. The extension requires some underlying virtual machine support for activation/deactivation of reading barriers and checking local variables when changing memory environments. Performance evaluations show the absolute and relative penalties of using this extended threading model.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
1) The document presents a Synchronous Scheduling Service (SSS) that introduces time-triggered orientation for distributed real-time Java applications.
2) The SSS is based on the Flexible Time-Triggered protocol and is supported as a new service in the Distributed Real-Time Specification for Java.
3) It provides a more predictable network management which is useful for high-integrity applications.
The document discusses dense linear algebra solvers and algorithms. It provides an overview of existing software for dense linear algebra including LINPACK, EISPACK, LAPACK, ScaLAPACK, PLASMA, and MAGMA. It then discusses challenges with dense linear algebra on modern hardware including distributed memory, heterogeneity, and the high cost of communication. It introduces tile algorithms as an approach to address these challenges compared to traditional LAPACK algorithms.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
- The document summarizes the networks at CERN, including the IT-CS group which manages communication services, the extensive IP network connecting equipment and facilities, and the networks supporting the Large Hadron Collider experiments. It describes the large data flows and computing challenges of the LHC experiments and the worldwide computing grid (WLCG) established to support this. It focuses on the LHCOPN connecting CERN and the Tier1 centers and the new LHCONE being developed to better support the changing computing models of the experiments.
GeoServer is an open source geospatial data server. This document discusses GeoSolutions, which develops and supports GeoServer and other open source geospatial projects. It provides an overview of GeoServer's capabilities including support for OGC standards like WMS, WFS, WCS and WPS. New features in version 2.2 include virtual services, improved referencing, and security enhancements. Planned developments for version 2.3 include a database configuration backend, improved GWC clustering, and support for CSW 2.0 and WCS 2.0 with Earth Observation extensions.
GeoServer presentation @ Italian GFOSS day 2008GeoSolutions
GeoServer is an open source server that allows users to share and edit geospatial data. It can handle raster and vector files from different data sources and supports common standards like WMS, WFS, and WCS. It has tools for styling maps and integrating geospatial data with other systems. GeoServer is highly configurable, supports many data formats, and continues to add new features through its active development community.
This document summarizes a research paper on stratified B-trees, a new data structure for versioned dictionaries that offers faster updates and optimal tradeoffs between space, query, and update costs compared to existing solutions like copy-on-write B-trees. Stratified B-trees arrange key-value pairs into levels of arrays to allow for sequential I/O during updates and range queries while maintaining density to avoid unnecessary duplication. The paper describes how stratified B-trees are structured, how operations like insertion and merging are performed, and analyzes their performance advantages over other versioned dictionaries.
The document discusses grid computing at CERN for the Large Hadron Collider experiment. CERN operates a worldwide computing grid with tiered levels to handle the massive computing and storage needs. Tier 0 is at CERN for data acquisition and distribution. Tier 1 centers have large storage and do data analysis. Tier 2 centers participate in simulation and analysis. The LHC generates 40 million collision events per second that are filtered and recorded, resulting in 15 petabytes of data per year. The computing grid is necessary to process and store this huge volume of data across the distributed centers.
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/memory-efficient-applications/francesc-alted
The document discusses performance issues with deduplicating millions of POI records in a MySQL database. It proposes building a local cache, using multiple threads to separate database queries from deduplication computation, and exploring NoSQL options with spatial support to improve performance. Various tests were conducted to identify bottlenecks and assumptions. The key findings were that disk I/O was the main bottleneck, deduplication time increased with candidate size, and using multiple threads significantly improved throughput.
The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
This document discusses improving Spark performance on many-core machines by implementing an in-memory shuffle. It finds that Spark's disk-based shuffle is inefficient on such hardware due to serialization, I/O contention, and garbage collection overhead. An in-memory shuffle avoids these issues by copying data directly between memory pages. This results in a 31% median performance improvement on TPC-DS queries compared to the default Spark shuffle. However, more work is needed to address other performance bottlenecks and extend the in-memory shuffle to multi-node clusters.
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
Tips, tricks, and gotchas learned at Localytics for optimizing MongoDB installs. Includes information about document design, indexes, fragmentation, migration, AWS EC2/EBS, and more.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
Lens is an open source Python library for automated data exploration of large datasets using Dask. It computes summary statistics and relationships between columns in a dataset. The results are serialized to JSON for interactive exploration through Jupyter widgets or a web UI. Dask allows the computations to run in parallel across a cluster for scalability. Lens integrates with the SherlockML platform to analyze all datasets uploaded.
Performing Large Scale Repeatable Software Engineering StudiesGeorgios Gousios
The document discusses performing large-scale software engineering studies. It outlines how empirical research is currently done, including issues with small sample sizes, lack of experiment replication, and unavailable tools and data. The document then proposes a platform for software engineering research to address these issues. The platform would provide pre-processed data in standard formats, shared tools and results, and large-scale processing capabilities to enable more rigorous empirical studies.
Python has evolved through multiple versions with improvements to performance, features, and the standard library. The talk summarized key changes between Python versions 2.2 through 2.5, including major new features in 2.2 like iterators and generators, performance optimizations in 2.3 and 2.4, and new features in 2.5 like the "with" statement for resource allocation and context managers.
Introduction of Java GC Tuning and Java Java Mission ControlLeon Chen
This document provides an introduction and overview of Java garbage collection (GC) tuning and the Java Mission Control tool. It begins with information about the speaker, Leon Chen, including his background and patents. It then outlines the Java and JVM roadmap and upcoming features. The bulk of the document discusses GC tuning concepts like heap sizing, generation sizing, footprint vs throughput vs latency. It provides examples and recommendations for GC logging, analysis tools like GCViewer and JWorks GC Web. The document is intended to outline Oracle's product direction and future plans for Java GC tuning and tools.
This document proposes an extension to the Real-Time Specification for Java (RTSJ) called RealtimeThread++. It aims to simplify the dual threading model of RTSJ by having a single thread type that can dynamically decide whether to run with or without garbage collector interference. This added flexibility could help avoid garbage collector priority inversions, enrich event handling programming, and enhance real-time distributed architectures in the Distributed RTSJ. The extension requires some underlying virtual machine support for activation/deactivation of reading barriers and checking local variables when changing memory environments. Performance evaluations show the absolute and relative penalties of using this extended threading model.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
1) The document presents a Synchronous Scheduling Service (SSS) that introduces time-triggered orientation for distributed real-time Java applications.
2) The SSS is based on the Flexible Time-Triggered protocol and is supported as a new service in the Distributed Real-Time Specification for Java.
3) It provides a more predictable network management which is useful for high-integrity applications.
The document discusses dense linear algebra solvers and algorithms. It provides an overview of existing software for dense linear algebra including LINPACK, EISPACK, LAPACK, ScaLAPACK, PLASMA, and MAGMA. It then discusses challenges with dense linear algebra on modern hardware including distributed memory, heterogeneity, and the high cost of communication. It introduces tile algorithms as an approach to address these challenges compared to traditional LAPACK algorithms.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
- The document summarizes the networks at CERN, including the IT-CS group which manages communication services, the extensive IP network connecting equipment and facilities, and the networks supporting the Large Hadron Collider experiments. It describes the large data flows and computing challenges of the LHC experiments and the worldwide computing grid (WLCG) established to support this. It focuses on the LHCOPN connecting CERN and the Tier1 centers and the new LHCONE being developed to better support the changing computing models of the experiments.
GeoServer is an open source geospatial data server. This document discusses GeoSolutions, which develops and supports GeoServer and other open source geospatial projects. It provides an overview of GeoServer's capabilities including support for OGC standards like WMS, WFS, WCS and WPS. New features in version 2.2 include virtual services, improved referencing, and security enhancements. Planned developments for version 2.3 include a database configuration backend, improved GWC clustering, and support for CSW 2.0 and WCS 2.0 with Earth Observation extensions.
GeoServer presentation @ Italian GFOSS day 2008GeoSolutions
GeoServer is an open source server that allows users to share and edit geospatial data. It can handle raster and vector files from different data sources and supports common standards like WMS, WFS, and WCS. It has tools for styling maps and integrating geospatial data with other systems. GeoServer is highly configurable, supports many data formats, and continues to add new features through its active development community.
This document summarizes a research paper on stratified B-trees, a new data structure for versioned dictionaries that offers faster updates and optimal tradeoffs between space, query, and update costs compared to existing solutions like copy-on-write B-trees. Stratified B-trees arrange key-value pairs into levels of arrays to allow for sequential I/O during updates and range queries while maintaining density to avoid unnecessary duplication. The paper describes how stratified B-trees are structured, how operations like insertion and merging are performed, and analyzes their performance advantages over other versioned dictionaries.
The document discusses grid computing at CERN for the Large Hadron Collider experiment. CERN operates a worldwide computing grid with tiered levels to handle the massive computing and storage needs. Tier 0 is at CERN for data acquisition and distribution. Tier 1 centers have large storage and do data analysis. Tier 2 centers participate in simulation and analysis. The LHC generates 40 million collision events per second that are filtered and recorded, resulting in 15 petabytes of data per year. The computing grid is necessary to process and store this huge volume of data across the distributed centers.
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/memory-efficient-applications/francesc-alted
The document discusses performance issues with deduplicating millions of POI records in a MySQL database. It proposes building a local cache, using multiple threads to separate database queries from deduplication computation, and exploring NoSQL options with spatial support to improve performance. Various tests were conducted to identify bottlenecks and assumptions. The key findings were that disk I/O was the main bottleneck, deduplication time increased with candidate size, and using multiple threads significantly improved throughput.
The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
This document discusses improving Spark performance on many-core machines by implementing an in-memory shuffle. It finds that Spark's disk-based shuffle is inefficient on such hardware due to serialization, I/O contention, and garbage collection overhead. An in-memory shuffle avoids these issues by copying data directly between memory pages. This results in a 31% median performance improvement on TPC-DS queries compared to the default Spark shuffle. However, more work is needed to address other performance bottlenecks and extend the in-memory shuffle to multi-node clusters.
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
Tips, tricks, and gotchas learned at Localytics for optimizing MongoDB installs. Includes information about document design, indexes, fragmentation, migration, AWS EC2/EBS, and more.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
Lens is an open source Python library for automated data exploration of large datasets using Dask. It computes summary statistics and relationships between columns in a dataset. The results are serialized to JSON for interactive exploration through Jupyter widgets or a web UI. Dask allows the computations to run in parallel across a cluster for scalability. Lens integrates with the SherlockML platform to analyze all datasets uploaded.
Performing Large Scale Repeatable Software Engineering StudiesGeorgios Gousios
The document discusses performing large-scale software engineering studies. It outlines how empirical research is currently done, including issues with small sample sizes, lack of experiment replication, and unavailable tools and data. The document then proposes a platform for software engineering research to address these issues. The platform would provide pre-processed data in standard formats, shared tools and results, and large-scale processing capabilities to enable more rigorous empirical studies.
Python has evolved through multiple versions with improvements to performance, features, and the standard library. The talk summarized key changes between Python versions 2.2 through 2.5, including major new features in 2.2 like iterators and generators, performance optimizations in 2.3 and 2.4, and new features in 2.5 like the "with" statement for resource allocation and context managers.
Ryosuke Iwanga gave a presentation about fighting big data at Mobage DBA. He discussed how Mobage grew from 600 million page views per day in 2006-2008 to over 2 billion page views per day with the introduction of social games in 2009-2010. He defined big data as having over 500 database servers, storing over 100TB of data, and handling over 1 million queries per second at peak times. He described techniques used at Mobage to scale out databases including replication, sharding, partitioning, and optimizing queries. He also discussed strategies for ensuring high availability, performing backups, and purging old data.
We created a web application for a well-known US newspaper, to create a maps-like zooming application on top of the 60,000 newspapers since 1850 and using Solr over the 28,000,000 articles to create an interactive heatmap over it. The out-of-the-box faceting solution was optimized using domain knowledge by order-of-magnitude which allowed us to create a great visual way of exploring trends in historical newspapers.
This document summarizes a presentation about Netflix's big data platform and Spark. The key points are:
1. Netflix uses Apache Spark on YARN and Mesos clusters to process batch and streaming data from sources like Cassandra and Kafka.
2. Netflix has contributed improvements to Spark's dynamic resource allocation, predicate pushdown, and support for S3 filesystems.
3. A use case showed Spark outperforming Pig for an iterative job that duplicated and aggregated data in multiple steps.
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
This document provides an overview and agenda for a lecture on graph processing using MapReduce. It discusses representing graphs as adjacency matrices or lists, and gives examples of single source shortest path and PageRank algorithms. Graph processing in MapReduce typically involves computations at each node and propagating those computations across the graph. Key challenges include representing graph structure suitably for MapReduce and traversing the graph in a distributed manner through multiple iterations.
The document provides an overview of distributed computing using Apache Hadoop. It discusses how Hadoop uses the MapReduce algorithm to parallelize tasks across large clusters of commodity hardware. Specifically, it breaks down jobs into map and reduce phases to distribute processing of large amounts of data. The document also notes that Hadoop is an open source framework used by many large companies to solve problems involving petabytes of data through batch processing in a fault tolerant manner.
Behind the Scenes at LiveJournal: Scaling StorytimeSergeyChernyshev
Brad talks about clustering setups using MySQL and DRDB and their Open Source software most of which he wrote initially and continues to develop.
A lot of these techniques and/or software is used by many other companies as well - among them Flickr/Yahoo! and Facebook.
Ruby 4.0 To Infinity and Beyond at Ruby Conference Kenya 2017 by Bozhidar BatsovMichael Kimathi
The document discusses plans and ideas for Ruby 4.0. Some key points include:
- Ruby 4.0, codenamed "Buzz", aims to be 4 times faster than Ruby 3.0 through optimizations and removing legacy features.
- New features may include immutable data structures like vectors, hashes and sets, as well as static typing and runtime contracts inspired by RDL.
- Improved concurrency through CSP-style APIs inspired by concurrent-ruby for better parallel programming.
- Simplicity is a core design principle - reducing redundancies and dropping unused or obscure features like BEGIN/END blocks, flip-flops, character literals and more.
- The standard library could be
This document discusses using MongoDB as a message queue. It summarizes the author's prior use of RabbitMQ, how MongoDB is now used at About.me for asynchronous and periodic tasks, and the benefits MongoDB provides as a message queue including sharding and durability. It also compares features of MongoDB queues to AMQP queues and provides code samples and performance benchmarks for using MongoDB as a message queue with Celery and Python.
The document discusses Spark operations like map, filter, reduceByKey, and their execution across partitions. It provides examples of transforming RDDs with word count and joining datasets. Machine learning algorithms like linear regression are also covered, including creating labeled point datasets, training models, and evaluating predictions. Logs and errors from running Spark tests in Python are displayed.
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...srisatish ambati
This document discusses strategies for optimizing cache performance in Java applications. It begins by providing examples of different caching technologies like Coherence, Gemfire, Ehcache, Cassandra and Memcached. It then discusses key metrics for measuring cache performance like insert, read and update latencies. The document outlines concepts like data locality, hit ratios and expiration policies that impact cache performance. It also demonstrates visualizing cache usage and heatmaps. Finally, it discusses techniques for optimizing the Java virtual machine for big data workloads, including reducing object overhead, using non-blocking collections to avoid locks, tuning garbage collection and avoiding memory leaks.
Similar to Mobile Mapping Spatial Database Framework (20)
This document provides guidance on how to give effective presentations. It emphasizes that audiences are often bored by typical presentations and suggests developing skills in areas like visual presenting, storytelling, preparation and simplification. The document encourages focusing presentations on scope and depth rather than details, using repetition to reinforce key points, and incorporating images, whitespace and consistency in design. Presenters are advised to think about the audience and their needs or resistance rather than just the presentation content or tools.
Geo-referenced human-activity-data; access, processing and knowledge extractionConor Mc Elhinney
This document discusses methods for accessing, processing, and extracting knowledge from geo-referenced human activity data. It describes challenges in modeling geospatial data from different sources and accessing data through spatial hierarchy models. It also covers processing paradigms for knowledge extraction, including spatial workflow patterns and temporal dynamics in communities from data sources like tweets. Visualization and interaction techniques are discussed, including moving toward 3D web-based visualization using technologies like WebGL. Feature extraction from data is highlighted as informing risk assessment knowledge.
The document discusses multi-thematic spatial databases for efficiently storing, accessing, processing, and visualizing large volumes of geospatial data from multiple sources and sensors. It describes experience with designing databases to handle terabytes of temporal, multi-sensor data using spatial indexing. The goals are a unified approach for multi-thematic data storage, efficient data handling, and enabling searches across time, space and attributes while incorporating visualizations.
The document discusses feature extraction from lidar data, including road extraction and roadside feature extraction. It outlines algorithms for extracting road edges with over 90% accuracy, and detecting poles, trees, and other roadside features in a fully automated manner. Ongoing work focuses on improving pole extraction and developing classifiers for different feature types like signs and light posts.
The document discusses LiDAR processing for road network asset inventory. It outlines an algorithm developed for extracting road edges from LiDAR point clouds without manual input. It also discusses using the extracted road edges to develop a road surface extraction algorithm. Pole detection and extraction methods are also examined. The goal is to develop automated feature extraction from mobile mapping LiDAR and image data for road inventory purposes.
This document discusses digital hologram image processing techniques. It begins with an introduction to digital holography and why image processing is needed to extract 3D information from digital holograms. Key topics covered include reconstructing digital holograms, focusing and segmentation techniques, and removing unwanted twin images and other artifacts from reconstructions. The document provides an overview of recording digital holograms and sources of error, as well as outlining various image processing approaches that can be applied.
Focused Image Creation Algorithms for digital holographyConor Mc Elhinney
1) The document describes algorithms for creating extended focused images from digital holograms of 3D objects. It involves using focus measures and depth from focus techniques on multiple hologram reconstructions to generate a depth map and then composite the reconstructions into a single in-focus image.
2) Two approaches for the extended focused image are presented - a pointwise approach that selects pixels from individual reconstructions, and a neighborhood approach that averages blocks of pixels.
3) Preliminary results demonstrate extended focused images generated with both approaches, though the neighborhood method produces smoother results by reducing errors.
1. Digital holography allows recording of 3D scene information as complex-valued data, unlike standard 2D images. Reconstructions from digital holograms can extract depth information.
2. Algorithms are needed to process volumes of reconstructions to extract 3D information like depth maps and segmented objects. This includes techniques like focus detection, depth-from-focus, and creating extended focused images.
3. Applications include background segmentation, depth segmentation to separate occluded objects, and synthetic scene creation by superimposing additional objects.
The document summarizes initial results from a project called European Road Safety Inspection (EuRSI) Mobile Mapping Project. It describes the goals of developing automated road inspection techniques using mobile mapping systems to replace manual methods. It details the equipment used - a Riegl VQ-250 LiDAR scanner and IXSEA LANDINS navigation system mounted on a van. An algorithm is presented to extract road edges from the LiDAR point cloud data combined with navigation information. Initial results demonstrating the road edge extraction are shown.
The document discusses digital hologram image processing. It begins by explaining why digital holography records 3D information about a scene compared to standard 2D images. It then discusses applying image processing techniques used for standard images to digital holograms to understand what is in a hologram. The document outlines research goals around developing algorithms to determine object information, locations, appearances and identities within a hologram. It also discusses making algorithms applicable to different types of holography.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
1. Mobile Terrestrial LiDAR Datasets
in a Spatial Database Framework
Dr. Conor Mc Elhinney
Postdoctoral Researcher
Mobile Mapping Group
7th MMT 16th June 2011