This document discusses Inmobi's analytics platform called Grill, which provides a unified analytics experience. Grill supports multiple execution engines and storage systems for Hive queries on data cubes. It rewrites queries to the most efficient execution engine and stores query histories. Grill provides a pluggable architecture and analytics capabilities on Inmobi's large Hadoop data warehouse.
Countdown to Zero - Counter Use Cases in AerospikeRonen Botzer
This is my talk from the third Israeli Aerospike User Group meetup. It covers modeling counters, handling hot keys on counters, and implementing accurate counters with Aerospike's strong consistency mode.
This document summarizes how switching from Hadoop to Spark for data science applications improved performance, reliability, and reduced costs at Salesforce. Some key issues addressed were handling large datasets across many S3 prefixes, efficiently computing segment overlap on skewed user data, and performing joins on highly skewed datasets. These changes resulted in applications that were 100x faster, used 10x less data, had fewer failures, and reduced infrastructure costs.
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
This document discusses location intelligence and GeoMesa. It begins with an introduction to location intelligence and GeoMesa. It then covers spatial data types, spatial SQL, and optimizing spatial SQL queries by extending Spark's Catalyst optimizer. Examples are provided to demonstrate calculating density of activity in San Francisco and generating a speed profile of a metro area using location data. Spatial analysis techniques like spatial joins, buffers, and geohashing are explored to extract insights from spatial data at scale.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)Hansol Kang
The document summarizes the basics of Deep Convolutional Neural Networks (DCNNs) including AlexNet and VGGNet. It discusses how AlexNet introduced improvements like ReLU activation and dropout to address overfitting issues. It then focuses on the VGGNet, noting that it achieved good performance through increasing depth using small 3x3 filters and adding convolutional layers. The document shares details of VGGNet configurations ranging from 11 to 19 weight layers and their performance on image classification tasks.
Countdown to Zero - Counter Use Cases in AerospikeRonen Botzer
This is my talk from the third Israeli Aerospike User Group meetup. It covers modeling counters, handling hot keys on counters, and implementing accurate counters with Aerospike's strong consistency mode.
This document summarizes how switching from Hadoop to Spark for data science applications improved performance, reliability, and reduced costs at Salesforce. Some key issues addressed were handling large datasets across many S3 prefixes, efficiently computing segment overlap on skewed user data, and performing joins on highly skewed datasets. These changes resulted in applications that were 100x faster, used 10x less data, had fewer failures, and reduced infrastructure costs.
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
This document discusses location intelligence and GeoMesa. It begins with an introduction to location intelligence and GeoMesa. It then covers spatial data types, spatial SQL, and optimizing spatial SQL queries by extending Spark's Catalyst optimizer. Examples are provided to demonstrate calculating density of activity in San Francisco and generating a speed profile of a metro area using location data. Spatial analysis techniques like spatial joins, buffers, and geohashing are explored to extract insights from spatial data at scale.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)Hansol Kang
The document summarizes the basics of Deep Convolutional Neural Networks (DCNNs) including AlexNet and VGGNet. It discusses how AlexNet introduced improvements like ReLU activation and dropout to address overfitting issues. It then focuses on the VGGNet, noting that it achieved good performance through increasing depth using small 3x3 filters and adding convolutional layers. The document shares details of VGGNet configurations ranging from 11 to 19 weight layers and their performance on image classification tasks.
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
This document discusses enhancements to the Spark SQL optimizer through improved statistics collection and cost-based optimization rules. It describes collecting table and column statistics from Hive metastore and developing 1D and 2D histograms. New rules estimate operator costs based on output rows and size. Join order, filter statistics, and handling unique columns are discussed. Future work includes faster histogram collection, expression statistics, and continuous feedback optimization.
The document discusses two Spark algorithms: outlier detection on categorical data and KNN join. It describes how the algorithms work, including mapping attributes to scores for outlier detection and using z-order curves to map points to a single dimension for KNN joins. It also provides performance results and best practices for implementing the algorithms in Spark and discusses applications in graph algorithms.
This document discusses visualizing database performance data using R. It begins with introductions of the presenter and Pythian. It then outlines topics to be covered, including data preprocessing, visualization tools/techniques, effective vs ineffective visuals, and common mistakes. The bulk of the document demonstrates various R visualizations like boxplots, scatter plots, filtering, smoothing, and heatmaps to explore and tell stories with performance data. It emphasizes summarizing data in a way that provides insights and surprises the audience.
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
The document discusses generative adversarial networks (GANs). It begins with an introduction to GANs, describing their concept and training process. It then reviews a seminal GAN paper, discussing its mathematical formulation of GAN training as a minimax game and theoretical results showing global optimality can be achieved. The document concludes by outlining the configuration, implementation, and flowchart for a GAN experiment.
Deep Convolutional GANs - meaning of latent spaceHansol Kang
DCGAN은 GAN에 단순히 conv net을 적용했을 뿐만 아니라, latent space에서도 의미를 찾음.
DCGAN 논문 리뷰 및 PyTorch 기반의 구현.
VAE 세미나 이슈 사항에 대한 리뷰.
my github : https://github.com/messy-snail/GAN_PyTorch
[참고]
https://github.com/znxlwm/pytorch-MNIST-CelebA-GAN-DCGAN
https://github.com/taeoh-kim/Pytorch_DCGAN
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).
PyTorch is an open-source machine learning library for Python. It is primarily developed by Facebook's AI research group. The document discusses setting up PyTorch, including installing necessary packages and configuring development environments. It also provides examples of core PyTorch concepts like tensors, common datasets, and constructing basic neural networks.
Representing and Querying Geospatial Information in the Semantic WebKostis Kyzirakos
The document discusses representing and querying geospatial information in the semantic web. It introduces stRDF, an extension of RDF that adds spatial literals and valid time to triples. It also introduces stSPARQL, an extension of SPARQL with functions for querying spatial data based on Open Geospatial Consortium standards. The document describes the Strabon system, which uses stRDF and supports both stSPARQL and the OGC standard GeoSPARQL for querying geospatial data stored in RDF graphs.
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs
There are several challenges in the NoSQL world. Especially if you have very high availability requirements you have to accept temporal inconsistencies which you need to resolve explicitly. This is usually a tough job which requires implementing case by case business logic or even bothering the users to decide about the correct state of your data.Wouldn't it be great if we could solve this conflict resolution and data reconciliation process in a generic way at a pure technical level?That's exactly what CRDTs (Conflict-free Replicated Data Types) are about. CRDTs are data structures that are guaranteed to converge to a desired state while enabling extreme availability of the datastore.In this session you will learn what CRDTs are, how to design them, what you can do with them, what their limitations and tradeoffs are – of course garnished with lots of tips and tricks. Get ready to push the availability of your datastore to the max!
Valdestilhas et al. propose using the most frequent K characters (MFKC) as a string similarity measure and develop efficient filtering approaches. They define MFKC and derive a similarity function σ. They present three filters - a hash intersection filter, frequency filter, and most frequent character filter - to efficiently compute string pairs with σ above a threshold. Their experimental evaluation shows the filters improve runtime over naive approaches while maintaining high precision, recall, and F-measure.
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
LSGAN은 기존의 GAN loss가 아닌 MSE loss를 사용하여, 더욱 realistic한 데이터를 생성함.
LSGAN 논문 리뷰 및 PyTorch 기반의 구현.
[참고]
Mao, Xudong, et al. "Least squares generative adversarial networks." Proceedings of the IEEE International Conference on Computer Vision. 2017.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
This document describes a lightning talk presented at the Greater Boston useR Group in July 2011 about using the googleVis package in R to create motion charts with only one line of code. It discusses Hans Rosling's use of animated charts, how Google incorporated this into their visualization API, and how the googleVis package allows users to leverage this in R. The talk includes examples of creating motion charts in R with googleVis using sample airline data.
The document discusses using the raster package in R to work with geographical grid data. It covers downloading and loading the raster package, creating raster objects and adding random values, reading in real climate data files, performing operations like cropping and aggregation, and sources for global climate data like WorldClim.
Building Scalable Semantic Geospatial RDF StoresKostis Kyzirakos
This document outlines a model called stRDF for representing geospatial and temporal data in RDF, along with a query language called stSPARQL. It also describes Strabon, a scalable geospatial RDF store for storing and querying stRDF data. Strabon extends the Semantic Web toolkit Sesame and uses PostGIS for geospatial indexing and functions. The document evaluates Strabon's performance against Sesame on geospatial linked data and synthetic datasets. Finally, it discusses other extensions like the RDFi framework for representing data with incomplete information.
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
AdMobius is a mobile audience management platform that uses Cascading for complex data aggregation and processing in its tech stack. Cascading allows AdMobius to easily write custom aggregators and workflows for tasks like device graph building, scoring, and profiling audiences at scale across billions of mobile devices. Some key benefits of using Cascading include its support for custom joins, taps for various data sources, and best practices like checkpointing and compression to optimize performance.
Topic Set Size Design with Variance Estimates from Two-Way ANOVATetsuya Sakai
This document discusses methods for determining an appropriate topic set size for new IR test collections. It presents two approaches: (1) ensuring high statistical power to detect differences between systems above a threshold, and (2) ensuring confidence intervals for pairwise system differences are below a threshold. Both require variance estimates, which can be obtained via three methods - a new two-way ANOVA method is considered the safest. Results show topic set sizes vary by evaluation measure and collection, but pooling depth can reduce assessment costs while maintaining statistical validity.
This document discusses data cubes in Apache Hive. It provides background on Hive and why it is used at Inmobi for analytics. It describes how data cubes are modeled and stored in Hive, including facts, dimensions, and storage. Examples of cube queries in Hive Query Language (HQL) are shown. The document also introduces Grill, Inmobi's analytics platform that utilizes Hive and provides additional capabilities like query scheduling and multiple execution engines.
Apache Lens is a unified analytics platform that enables multi-dimensional queries over datasets stored in multiple data warehouses like Hadoop and columnar databases. It provides a single metadata layer and OLAP cube abstraction to allow for data discovery and unified access across data sources. Lens uses a distributed architecture and can push queries to where data resides for efficient processing.
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
This document discusses enhancements to the Spark SQL optimizer through improved statistics collection and cost-based optimization rules. It describes collecting table and column statistics from Hive metastore and developing 1D and 2D histograms. New rules estimate operator costs based on output rows and size. Join order, filter statistics, and handling unique columns are discussed. Future work includes faster histogram collection, expression statistics, and continuous feedback optimization.
The document discusses two Spark algorithms: outlier detection on categorical data and KNN join. It describes how the algorithms work, including mapping attributes to scores for outlier detection and using z-order curves to map points to a single dimension for KNN joins. It also provides performance results and best practices for implementing the algorithms in Spark and discusses applications in graph algorithms.
This document discusses visualizing database performance data using R. It begins with introductions of the presenter and Pythian. It then outlines topics to be covered, including data preprocessing, visualization tools/techniques, effective vs ineffective visuals, and common mistakes. The bulk of the document demonstrates various R visualizations like boxplots, scatter plots, filtering, smoothing, and heatmaps to explore and tell stories with performance data. It emphasizes summarizing data in a way that provides insights and surprises the audience.
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
The document discusses generative adversarial networks (GANs). It begins with an introduction to GANs, describing their concept and training process. It then reviews a seminal GAN paper, discussing its mathematical formulation of GAN training as a minimax game and theoretical results showing global optimality can be achieved. The document concludes by outlining the configuration, implementation, and flowchart for a GAN experiment.
Deep Convolutional GANs - meaning of latent spaceHansol Kang
DCGAN은 GAN에 단순히 conv net을 적용했을 뿐만 아니라, latent space에서도 의미를 찾음.
DCGAN 논문 리뷰 및 PyTorch 기반의 구현.
VAE 세미나 이슈 사항에 대한 리뷰.
my github : https://github.com/messy-snail/GAN_PyTorch
[참고]
https://github.com/znxlwm/pytorch-MNIST-CelebA-GAN-DCGAN
https://github.com/taeoh-kim/Pytorch_DCGAN
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).
PyTorch is an open-source machine learning library for Python. It is primarily developed by Facebook's AI research group. The document discusses setting up PyTorch, including installing necessary packages and configuring development environments. It also provides examples of core PyTorch concepts like tensors, common datasets, and constructing basic neural networks.
Representing and Querying Geospatial Information in the Semantic WebKostis Kyzirakos
The document discusses representing and querying geospatial information in the semantic web. It introduces stRDF, an extension of RDF that adds spatial literals and valid time to triples. It also introduces stSPARQL, an extension of SPARQL with functions for querying spatial data based on Open Geospatial Consortium standards. The document describes the Strabon system, which uses stRDF and supports both stSPARQL and the OGC standard GeoSPARQL for querying geospatial data stored in RDF graphs.
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs
There are several challenges in the NoSQL world. Especially if you have very high availability requirements you have to accept temporal inconsistencies which you need to resolve explicitly. This is usually a tough job which requires implementing case by case business logic or even bothering the users to decide about the correct state of your data.Wouldn't it be great if we could solve this conflict resolution and data reconciliation process in a generic way at a pure technical level?That's exactly what CRDTs (Conflict-free Replicated Data Types) are about. CRDTs are data structures that are guaranteed to converge to a desired state while enabling extreme availability of the datastore.In this session you will learn what CRDTs are, how to design them, what you can do with them, what their limitations and tradeoffs are – of course garnished with lots of tips and tricks. Get ready to push the availability of your datastore to the max!
Valdestilhas et al. propose using the most frequent K characters (MFKC) as a string similarity measure and develop efficient filtering approaches. They define MFKC and derive a similarity function σ. They present three filters - a hash intersection filter, frequency filter, and most frequent character filter - to efficiently compute string pairs with σ above a threshold. Their experimental evaluation shows the filters improve runtime over naive approaches while maintaining high precision, recall, and F-measure.
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
LSGAN은 기존의 GAN loss가 아닌 MSE loss를 사용하여, 더욱 realistic한 데이터를 생성함.
LSGAN 논문 리뷰 및 PyTorch 기반의 구현.
[참고]
Mao, Xudong, et al. "Least squares generative adversarial networks." Proceedings of the IEEE International Conference on Computer Vision. 2017.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
This document describes a lightning talk presented at the Greater Boston useR Group in July 2011 about using the googleVis package in R to create motion charts with only one line of code. It discusses Hans Rosling's use of animated charts, how Google incorporated this into their visualization API, and how the googleVis package allows users to leverage this in R. The talk includes examples of creating motion charts in R with googleVis using sample airline data.
The document discusses using the raster package in R to work with geographical grid data. It covers downloading and loading the raster package, creating raster objects and adding random values, reading in real climate data files, performing operations like cropping and aggregation, and sources for global climate data like WorldClim.
Building Scalable Semantic Geospatial RDF StoresKostis Kyzirakos
This document outlines a model called stRDF for representing geospatial and temporal data in RDF, along with a query language called stSPARQL. It also describes Strabon, a scalable geospatial RDF store for storing and querying stRDF data. Strabon extends the Semantic Web toolkit Sesame and uses PostGIS for geospatial indexing and functions. The document evaluates Strabon's performance against Sesame on geospatial linked data and synthetic datasets. Finally, it discusses other extensions like the RDFi framework for representing data with incomplete information.
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
AdMobius is a mobile audience management platform that uses Cascading for complex data aggregation and processing in its tech stack. Cascading allows AdMobius to easily write custom aggregators and workflows for tasks like device graph building, scoring, and profiling audiences at scale across billions of mobile devices. Some key benefits of using Cascading include its support for custom joins, taps for various data sources, and best practices like checkpointing and compression to optimize performance.
Topic Set Size Design with Variance Estimates from Two-Way ANOVATetsuya Sakai
This document discusses methods for determining an appropriate topic set size for new IR test collections. It presents two approaches: (1) ensuring high statistical power to detect differences between systems above a threshold, and (2) ensuring confidence intervals for pairwise system differences are below a threshold. Both require variance estimates, which can be obtained via three methods - a new two-way ANOVA method is considered the safest. Results show topic set sizes vary by evaluation measure and collection, but pooling depth can reduce assessment costs while maintaining statistical validity.
This document discusses data cubes in Apache Hive. It provides background on Hive and why it is used at Inmobi for analytics. It describes how data cubes are modeled and stored in Hive, including facts, dimensions, and storage. Examples of cube queries in Hive Query Language (HQL) are shown. The document also introduces Grill, Inmobi's analytics platform that utilizes Hive and provides additional capabilities like query scheduling and multiple execution engines.
Apache Lens is a unified analytics platform that enables multi-dimensional queries over datasets stored in multiple data warehouses like Hadoop and columnar databases. It provides a single metadata layer and OLAP cube abstraction to allow for data discovery and unified access across data sources. Lens uses a distributed architecture and can push queries to where data resides for efficient processing.
As the amount of metrics, software that produce and process them, and people involved in them continue to increase, we need better ways to organize them, to make them self-describing, and do so in a way that is consistent. Leveraging this, we can then automatically build graphs and dashboards, given a query that represents an information need, even for complicated cases. We can build richer visualizations, alerting and fault detection. This talk will introduce the concepts and related tools, demonstrate possibilities using the Graph-Explorer interface, and lay the groundwork for future work.
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
This document discusses using the ganalytics R package to access and analyze Google Analytics data through R. It provides an overview of Google Analytics and its APIs, demonstrates how to build queries with ganalytics, extract and summarize data in R. It also discusses enhancing ganalytics by improving documentation, testing, adding features, and internationalization. The document encourages participation in open source development of the package.
- The document discusses Inmobi's analytics data warehouse which contains 170 TB of data and the challenges of querying across different data stores and execution engines.
- It introduces Apache Hive and the OLAP cube model for representing multi-dimensional data, and provides examples of queries on cubes.
- Grill is presented as Inmobi's solution to unify querying across Hive, Impala, and other engines through a single interface and metadata catalog. A demo of Grill's capabilities is included in the agenda.
This document discusses Inmobi's analytics platform and the challenges it aimed to address with Grill. It provides an overview of Hive and how Grill leverages Hive to provide a unified query layer across different execution engines and data stores. Key points include how Grill allows OLAP queries on cubes, rewrites queries for different storages, and selects the most efficient execution engine to run the query. The demo then shows Grill in action.
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
1. Scalding is a library that provides a concise domain-specific language (DSL) for writing MapReduce jobs in Scala. It allows defining source and sink connectors, as well as data transformation operations like map, filter, groupBy, and join in a more readable way than raw MapReduce APIs.
2. Some use cases for Scalding include splitting or reusing data streams, handling exotic data sources like JDBC or HBase, performing joins, distributed caching, and building connected user profiles by bridging data from different sources.
3. For connecting user profiles, Scalding can be used to model the data as a graph with vertices for user interests and edges for bridging rules.
Michael will present an overview of Elastic's machine learning capabilities.
As we know, data science work can be messy, fractured, and challenging as data volumes increase. This session will explore how the Elastic stack can offer a single destination for data ingestion and exploration, time series modeling, and communication of results through data visualizations by focusing on a few sample data sources.
We will also explore new functionality offered by Elastic machine learning, in particular an integration with our APM solution.
Trained as a mathematician, Michael Hirsch started his career with no development experience. His first task - "model the world in a relational database." Over the last 7 years Michael has established himself a data scientist, with a focus on building end-to-end systems. In his career, he has built machine learning powered platforms for clients including Nike, Samsung, and Marvel, and approaches his work with the idea that machine learning is only as useful as the interfaces that users interact with.
Currently, Michael is a Product Engineer for Machine Learning at Elastic. He focuses on tailoring Elastic's ML offering to customer use cases, as well as integrating machine learning capabilities across the entire Elastic Stack.
This document outlines a data science workflow to predict cab booking cancellations. It describes collecting booking data containing 18 features, engineering new features like distance and user segmentation, and using random forests and logistic regression for modelling. Random forests achieved 74.3% accuracy on the test set after tuning the maximum depth. Distance, creation month, and online booking were the most important features. The results could help reduce fleet sizes in high cancellation months and notify users likely to cancel.
Grill is a unified analytics platform developed by Inmobi to address problems with disparate data storage and querying across their Hadoop and SQL data warehouses. Grill provides a single interface and catalog to allow both ad-hoc and canned queries to run interactively or in batch mode over billions of records. It uses a pluggable execution engine architecture to optimize query costs and allows queries to span multiple storage systems like Hadoop, HBase and SQL data stores.
This document outlines the agenda for a two-day workshop on learning R and analytics. Day 1 will introduce R and cover data input, quality, and exploration. Day 2 will focus on data manipulation, visualization, regression models, and advanced topics. Sessions include lectures and demos in R. The goal is to help attendees learn R in 12 hours and gain an introduction to analytics skills for career opportunities.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Structured streaming provides a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It allows processing live data streams using continuous queries that look identical to batch queries. The presentation discusses Spark components including RDDs, DataFrames and Datasets. It then covers limitations of the traditional Spark Streaming model and how structured streaming addresses them by using incremental execution plans and exactly-once semantics. An example of a word count application and demo is presented to illustrate structured streaming concepts.
The document describes the KDM tool, which automates Cassandra data modeling tasks. It streamlines the data modeling methodology by guiding users and automating conceptual to logical mapping, physical optimizations, and CQL generation. The KDM tool simplifies the complex data modeling process, eliminates human errors, and helps users build, verify, and learn data modeling. Future work on the tool includes support for materialized views, user defined types, application workflow design, and additional diagram types.
This document provides an introduction to Mahout, an Apache project for scalable machine learning. It discusses Mahout's math library capabilities including matrices, vectors, functions and sampling. It also covers Mahout's clustering, classification and recommendation algorithms. The document focuses on recommendation systems, describing basic collaborative filtering approaches and how to address problems like cold starts and leverage multiple data types. It introduces the idea of cross-recommendation to predict items from different behavior streams.
This document provides an introduction to Mahout, an Apache project for scalable machine learning. It discusses Mahout's math library capabilities including matrices, vectors, functions and sampling. It also covers Mahout's clustering, classification and recommendation algorithms. The document then focuses on recommendation systems, describing basic collaborative filtering approaches and how to address their limitations through multi-modal recommendations that incorporate multiple data types. It provides an example of how video recommendations could be generated based on user queries.
Schema Design by Chad Tindel, Solution Architect, 10genMongoDB
MongoDB’s basic unit of storage is a document. Documents can represent rich, schema-free data structures, meaning that we have several viable alternatives to the normalized, relational model. In this talk, we’ll discuss the tradeoff of various data modeling strategies in MongoDB using a library as a sample application. You will learn how to work with documents, evolve your schema, and common schema design patterns.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
5. Global Mobile technology company enabling
- Developers & Publishers to monetize
- Advertisers to engage and acquire users
@ Scale
About InMobi
6. Digital advertising – Intro
Courtesy: http://www.liesdamnedlies.com/
Owns & Sells
Real estate
on digital
inventory
Has reach to
users
Wants to
target Users
Brings money
Market place
Consumer
8. Users
• Advertisers and publishers
• Regional officers
• Account managers
• Executive team
• Business analysts
• Product analysts
• Data scientists
• Engineering systems
• Developers
Use cases
9. Reporting
Understand trends
Debugging / Postmortem of issues (troubleshooting)
Sizing & Estimation (Ex: inventory, reach)
Summary of Product lines, Geographies, Network (Ex: Rev by Geo)
Use cases
10. Categorize the use cases
• Batch queries
• Adhoc queries
• Interactive queries
• Scheduled reports
• Infer insights through ML algorithms
Use cases
11. Adhoc querying system Dashboard system
Customer facing system
Analytics systems at Inmobi
12. • Disparate user experience
• Disparate data storage systems causing inability to scale
• Schema management
• Not leveraging community around
Problems
14. Associates structure to data
Provides Metastore and catalog
service – Hcatalog
Provides pluggable storage
interface
Accepts SQL like batch queries
HQL is widely adopted
language by systems like
Shark, Impala
Has strong apache community
Data warehouse features like
facts, dimensions
Logical table associated with
multiple physical storages
Interactive queries
Scheduling queries
UI
WhatdoesHiveprovide
WhatismissinginHive
Apache Hive to the rescue
18. Data Model
Cube Storage Fact Table
Physical
Fact tables
Dimension
Table
Physical
Dimension
tables
19. Data Model - Cube
Dimension
• Simple Dimension
• Referenced Dimension
• Hierarchical Dimension
• Expression Dimension
• Timed dimension
Measure
• Column Measure
• Expression Measure
Cube
Measures Dimensions
Note : Some of the concepts are borrowed from
http://community.pentaho.com/projects/mondrian/
20. Data Model – Storage
Storage
Name
End point
Properties
Ex : ProdCluster,
StagingCluster, Postgres1,
HBase1, HBase2
21. Data Model – Fact Table
Fact
table
Cube
Fact
table
Storage
Fact Table
Columns
Cube that it belongs
Storages on which it is
present and the associated
update periods
22. Data Model – Dimension table
Dimension Table
Columns
Dimension references
Storages on which it is present
and associated snapshot dump
period, if any.
Cube
Dimension
table
Dimension
table
Dimension
table
Storage
23. Data Model – Storage tables and partitions
Storage table
Belongs to fact/dimension
Associated storage
descriptor
Partitioned by columns
Naming convention –
storage name followed by
fact/dimension name
Partition can override its
storage descriptor
• Fact storage table
Fact table
• Dimension storage table
Dimension table
25. CUBE SELECT [DISTINCT]
select_expr, select_expr, ...
FROM cube_table_reference
WHERE [where_condition AND]
TIME_RANGE_IN(colName , from, to)
[GROUP BY col_list]
[HAVING having_expr]
[ORDER BY colList]
[LIMIT number]
cube_table_reference:
cube_table_factor
| join_table
join_table:
cube_table_reference JOIN cube_table_factor
[join_condition]
| cube_table_reference {LEFT|RIGHT|FULL} [OUTER]
JOIN cube_table_reference [join_condition]
cube_table_factor:
cube_name [alias]
| ( cube_table_reference )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
colOrder: ( ASC | DESC )
colList : colName colOrder? (',' colName colOrder?)*
Queries on Data cubes
26. • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable
citytable LIMIT 100
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable
citytable WHERE (citytable.dt = 'latest') LIMIT 100
cube select name, stateid from citytable limit 100
Example query
27. Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER
JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE ((
testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-
10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR
(testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-
10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR
(testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-
10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR
(testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-
10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR
(testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-
00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt
= 'latest')
GROUP BY(citytable.name)
cube select citytable.name, msr2 from testcube where
timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
28. Available in Hive
• Data warehouse features
like facts, dimensions
• Logical table associated
with multiple physical
storages
Available in Grill
• Pluggable execution engine for
HQL
• Query history, caching
• Scheduling queries
Where is it available
30. Unified analytics platform at Inmobi
• Supports multiple execution engines
• Supports multiple storages
Provides analytics on the system
Provides query history
Grill
32. Implements an interface
• execute
• explain
• executeAsynchronously
• fetchResults
• Specify all storages it can support
Pluggable execution engine
33. Cube QL query
Rewrite query for available execution engine’s
supported storages
Get cost of the rewritten query from each
execution engine
Pick up execution engine with least cost and
fire the query
Cube query with multiple execution engines
36. • Number of queries - 700 to 900 per day
• Number of dimension tables - 125
• Number of fact tables – 24
• Number cubes – 15
• Size of the data
• Total size – 136 TB
• Dimension data – 400 MB compressed per hour
• Raw data - 1.2 TB per day
• Aggregated facts- 53GB per day
Data ware house statistics
Inmobi is global mobile technology company, which allows app developers and other publishers to use their space on the mobile and monetize. And allows advertisers to engage and acquire users. And all this at scale – InMobiserves few billions of ad impressions per day.
Inmobi provides marketplace, where it buys the space on mobile from publishers and sells it to advertisers, meanwhile it acquires users.
Inmobi has 130TB hadoop warehouse and 5TB SQL warehouse. Let us see an example of reporting page. This is the dashboard a publisher sees.
Users of analytics system
Adhoc querying systemAdhoc and Batch queriesScheduled queriesBased on HadoopMapreduceProvides UI and custom apiData is stored in HDFSDashboard systemCanned reportsInteractive and adhoc queriesProvides UI and Custom apiData is stored in columnar DWHCustomer facing systemFace to the outside world (Advertisers and publishers)Interactive and adhoc queriesProvides UI and custom apiData is stored in relational DB, PostgresConventional columnar databases (RDBMS) systems lend themselves well for interactive SQL queries over reasonably small datasets in the order of 10-100s of GB, while hadoop based warehouses operate well over large datasets in the order of TBs and PBs and scales fairly linearly. Though there have been some improvements recently in storage structures in the Hadoop warehouses such as ORC, queries over hadoop still typically adopts a full scan approach. Choosing between these different data stores based on cost of storage, concurrency, scalability and performance is fairly complex and not easy for most users.
Individually all the systems we just saw work really great! They provide best time responses to user queries.Disparate user experience because of multiple reporting systemsInvolves a learning curve for systems and their apiDisparate data storage systems causing inability to scaleAltering schema involves different systemsData discoveryCannot leverage data in other systemsNot leveraging community aroundCannot experiment with new storage/execution engine out of the box
Now let us see why Inmobi wants to use Apache Hive
Column Measure : name, type, default aggregate, format string, start date, end dateExpression Measure : Associated ExpressionSimple Dimension: name, type, start date, end dateReferenced Dimension : Referencing table and columnHierarchical Dimension :hierarchyExpression Dimension : Associated expression
The grammar is subset of HQLResolve candidate dimension tables and the storage tables .Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid.Resolve fact storage tables for the queried time range.Automatically resolve joins using the relationships between cubes and dimension.Automatically add aggregate functions to measures.Add expression to group by clause, if projected; and project group by clause, if it is not.