- The document discusses Inmobi's analytics data warehouse which contains 170 TB of data and the challenges of querying across different data stores and execution engines.
- It introduces Apache Hive and the OLAP cube model for representing multi-dimensional data, and provides examples of queries on cubes.
- Grill is presented as Inmobi's solution to unify querying across Hive, Impala, and other engines through a single interface and metadata catalog. A demo of Grill's capabilities is included in the agenda.
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets and subsecond query latency.
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets and subsecond query latency.
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Kylin Open Source Web Site: http://kylin.io
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
Why you care about relational algebra (even though you didn’t know it)Julian Hyde
A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015.
With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location.
Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating.
Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
Hongbin Ma and Luke Han (Kyligence)
Apache Kylin is an open source distributed analytics engine that provides a SQL interface and multi-dimensional analysis on Hadoop supporting extremely large datasets. In the forthcoming Kylin release, we optimized query performance by exploring the potentials of parallel storage on top of HBase. This talk explains how that work was done.
What's new in Mondrian 4? Slides for a talk given by Julian Hyde to the Pentaho Bay Area User Group meetup in San Francisco on April 3rd, 2014.
Topics covered include attributes and attribute hierarchies, measure groups and aggregate tables, physical schema, and how to download and start using Mondrian 4 beta with Pentaho CE.
Apache Kylin general introduction, including background, business needs and technical challenges, theory and architecture, features and some tech detail. Following with performance and benchmark, finally, ecosystem and roadmap.
More detail, please visit http://kylin.io or follow @ApacheKylin.
Cost-based query optimization in Apache Hive 0.14Julian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq's lead developer Julian Hyde shows the improvements that CBO is bringing in Apache Hive 0.14.
For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive with the Stinger.next initiative.
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Kylin Open Source Web Site: http://kylin.io
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite.
Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics.
We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.
Why you care about relational algebra (even though you didn’t know it)Julian Hyde
A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015.
With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location.
Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating.
Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
Hongbin Ma and Luke Han (Kyligence)
Apache Kylin is an open source distributed analytics engine that provides a SQL interface and multi-dimensional analysis on Hadoop supporting extremely large datasets. In the forthcoming Kylin release, we optimized query performance by exploring the potentials of parallel storage on top of HBase. This talk explains how that work was done.
What's new in Mondrian 4? Slides for a talk given by Julian Hyde to the Pentaho Bay Area User Group meetup in San Francisco on April 3rd, 2014.
Topics covered include attributes and attribute hierarchies, measure groups and aggregate tables, physical schema, and how to download and start using Mondrian 4 beta with Pentaho CE.
Apache Kylin general introduction, including background, business needs and technical challenges, theory and architecture, features and some tech detail. Following with performance and benchmark, finally, ecosystem and roadmap.
More detail, please visit http://kylin.io or follow @ApacheKylin.
Cost-based query optimization in Apache Hive 0.14Julian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq's lead developer Julian Hyde shows the improvements that CBO is bringing in Apache Hive 0.14.
For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive with the Stinger.next initiative.
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
OLTP Vs. OLAP
Types Of OLAP
ROLAP V/s MOLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin
References
For any queries
Contact Us:- argonauts007@gmail.com
The slides for the first ever SnappyData webinar. Covers SnappyData core concepts, programming models, benchmarks and more.
SnappyData is open sourced here: https://github.com/SnappyDataInc/snappydata
We also have a deep technical paper here: http://www.snappydata.io/snappy-industrial
We can be easily contacted on Slack, Gitter and more: http://www.snappydata.io/about#contactus
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
When making machine learning applications in Uber, we identified a sequence of common practices and painful procedures, and thus built a machine learning platform as a service. We here present the key components to build such a scalable and reliable machine learning service which serves both our online and offline data processing needs.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
4. Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
5. Digital advertising at Inmobi
Courtesy: http://www.liesdamnedlies.com/
Owns & Sells
Real estate
on digital
inventory
Has reach to
users
Wants to
target Users
Brings money
Market place
Consumer
6. Analytics Use cases
• Understanding Trends &
Inference
• Forecasting and Anamoly
detection
Data scientists
• Feedback to improve Ad
Relevance in Real Time
Engineering systems
• Troubleshooting of issues
Developers
• Publisher/Advertiser specific
analytics(dashboards)
Advertisers and publishers
• Tracking metrics
Account
managers/Executive team
• Inventory sizing and
estimation
Business/Product
analysts
7. • Canned/Dashboard queries
• Adhoc queries
• Interactive/Batch queries
• Scheduled queries
• Infer insights through ML algorithms
Categorize the use cases
8. Adhoc querying system Internal Dashboards
Customer facing
Dashboards and
Reporting
Analytics systems at Inmobi
9. Analytics Warehouses at Inmobi and Scale
• Billions of Ad Requests/Impressions per day
• 170 TB Hadoop Warehouse
• 5 TB SQL Columnar Datawarehouse
• 70 TB Hbase cluster
• DBMS
• Spark/Shark in near future
10. Why both Hadoop and SQL warehouse?
Canned
Adhoc
Response Times
IO (Input
Records)
Adhoc
Canned Adhoc
Query Engine
Query
Dashboard queries
are mostly canned
and Interactive
Adhoc queries can be
Interactive or batch
depending on the
data volumes and
query complexity
11. • Disparate user experience
• Disparate data storage and execution engines
• Schema management across storages
• Data discovery
• Not leveraging ‘SQL on Hadoop’ community
Problems
12. Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
13. Associates structure to data
Provides Metastore and
catalog service – Hcatalog
Provides pluggable storage
interface
Accepts SQL like queries
HQL is widely adopted
language by systems like
Shark, Impala
Has strong apache
community
Data warehouse features like
cubes, facts, dimensions
Logical table associated with
multiple physical storages
Pluggable execution engine
Query lifecycle management
Query quota management
Scheduling queries
WhatdoesHiveprovide
WhatismissinginHive
Apache Hive to the rescue
16. Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
17. Data Model
Cube Storage Fact Table
Physical
Fact tables
Dimension
Table
Physical
Dimension
tables
18. Data Model - Cube
Dimension
• Simple Dimension
• Referenced Dimension
• Hierarchical Dimension
• Expression Dimension
• Timed dimension
Measure
• Column Measure
• Expression Measure
Cube
Measures Dimensions
Note : Some of the concepts are borrowed from
http://community.pentaho.com/projects/mondrian/
19. Data Model – Storage
Storage
• Name
• End point
• Properties
• Ex : ProdCluster, StagingCluster, Postgres1,
HBase1, HBase2
20. Data Model – Fact Table
Fact
table
Cube
Fact
table
Storage
FactTable
• Columns
• Cube that it belongs
• Storages on which it is
present and the
associated update
periods
21. Data Model – Dimension table
DimensionTable
• Columns
• Dimension references
• Storages on which it is
present and associated
snapshot dump period, if
any.
Cube
Dimension
table
Dimension
table
Dimension
table
Storage
22. Data Model – Storage tables and partitions
Storagetable
• Belongs to fact/dimension
• Associated storage descriptor
• Partitioned by columns
• Naming convention – storage
name followed by
fact/dimension name
• Partition can override its
storage descriptor
• Fact storage table
Fact table
• Dimension storage table
Dimension table
26. Resolve candidate tables and storages
Automatically resolve joins
Resolve aggregates and groupby expressions
Allows SQL over Cube QL
Queries can span multiple storages
Accepts multi time range queries
All Hive QL features
Query features
27. • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable
citytable LIMIT 100
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable
citytable WHERE (citytable.dt = 'latest') LIMIT 100
cube select name, stateid from citytable limit 100
Example query
28. Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER
JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE ((
testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-
10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR
(testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-
10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR
(testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-
10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR
(testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-
10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR
(testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-
00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt
= 'latest')
GROUP BY(citytable.name)
cube select citytable.name, msr2 from testcube where
timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
29. Available in Hive
• Data warehouse features
like facts, dimensions
• Logical table associated
with multiple physical
storages
Available in Grill
• Pluggable execution engine for
HQL
• Query life cycle management
• Scheduling queries
Where is it available
30. Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
31. GRILL
Unify the Catalog and Query layer for Adhoc/Canned
Batch/Interactive
Reports on single Interface
33. Implements an interface
• explain
• execute
• executeAsynchronously
• fetchResults
• Specify all storages it can support
Pluggable execution engine
34. OLAP Cube QL query
Rewrite query for available execution engine’s
supported storages
Get cost of the rewritten query from each
execution engine
Pick up execution engine with least cost and
fire the query
Cube query with multiple execution engines
36. Grill – current state
Server
Query Service
Metastore Service
Metrics
Query statistics(In progress)
Scheduled queries(In
progress)
Query caching(In progress)
Client
Java Client
CLI
JDBC Client
Execution Engine
Hive Driver
JDBC Driver
Impala Driver
37. • Normalize query cost
• Load balancing across execution engines
• Alter meta hooks in StorageHandler
• Authentication and authorization
• Machine learning through Grill
• Query quota management
Grill roadmap
38. • Number of queries - 700 to 900 per day
• Number of dimension tables - 125
• Number of fact tables – 30
• Number cubes – 22
• Size of the data
• Total size – 170 TB
• Dimension data – 400 MB compressed per hour
• Raw data - 1.2 TB per day
• Aggregated facts- 53GB per day
Data ware house statistics
Inmobi provides marketplace, where it buys the space on mobile from publishers and sells it to advertisers, meanwhile it acquires users.
Adhoc querying system
Adhoc and Batch queries
Scheduled queries
Based on Hadoop Mapreduce
Provides UI and custom api
Data is stored in HDFS
Dashboard system
Canned reports
Interactive and adhoc queries
Provides UI and Custom api
Data is stored in columnar DWH
Customer facing system
Face to the outside world (Advertisers and publishers)
Interactive and adhoc queries
Provides UI and custom api
Data is stored in relational DB, Postgres
Inmobi has 130TB hadoop warehouse and 5TB SQL warehouse. Let us see an example of reporting page. This is the dashboard a publisher sees.
Conventional columnar databases (RDBMS) systems lend themselves well for interactive SQL queries over reasonably small datasets in the order of 10-100s of GB, while hadoop based warehouses operate well over large datasets in the order of TBs and PBs and scales fairly linearly. Though there have been some improvements recently in storage structures in the Hadoop warehouses such as ORC, queries over hadoop still typically adopts a full scan approach. Choosing between these different data stores based on cost of storage, concurrency, scalability and performance is fairly complex and not easy for most users.
Individually all the systems we just saw work really great! They provide best time responses to user queries.
Disparate user experience because of multiple reporting systems
Involves a learning curve for systems and their api
Disparate data storage systems causing inability to scale
Altering schema involves different systems
Data discovery
Cannot leverage data in other systems
Not leveraging community around
Cannot experiment with new storage/execution engine out of the box
Column Measure : name, type, default aggregate, format string, start date, end date
Expression Measure : Associated Expression
Simple Dimension: name, type, start date, end date
Referenced Dimension : Referencing table and column
Hierarchical Dimension :hierarchy
Expression Dimension : Associated expression
The grammar is subset of HQL
Resolve candidate dimension tables and the storage tables .
Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid.
Resolve fact storage tables for the queried time range.
Automatically resolve joins using the relationships between cubes and dimension.
Automatically add aggregate functions to measures.
Add expression to group by clause, if projected; and project group by clause, if it is not.
Resolve candidate dimension tables and the storage tables .
Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid.
Resolve fact storage tables for the queried time range.
Automatically resolve joins using the relationships between cubes and dimension.
Automatically add aggregate functions to measures.
Add expression to group by clause, if projected; and project group by clause, if it is not.