@ScalaByTheBay conference talk description: In this talk, we’ll discuss how we define and query cubes across multiple data stores for reporting purposes. With a single definition, we are able to decide at query time the best table/data source to answer a given request. We must take into consideration things such as time zone conversion, data availability, supported fact/dim based operations, request granularity, defined constraints, time range of request, and etc. Ultimately, our request is answered using Hive or RDBMS or Druid. This allows us to take advantage of performance characteristics of each data store while also allowing for a single interface for querying. Our goal isn’t to create a unified SQL layer which can be used to query multiple data stores. Our goal is to define a single view of the data where we can define post aggregates or other derived expressions which can later be used to programmatically generate a query for the target data store.
Why you don't need maths to get benefits of mlAseem Bansal
With all the hype around ML/AI everyone is looking at it. There is a widespread perception that you need to know Maths before you can do Machine learning. In this session we share why that is not true.
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime.
Why you don't need maths to get benefits of mlAseem Bansal
With all the hype around ML/AI everyone is looking at it. There is a widespread perception that you need to know Maths before you can do Machine learning. In this session we share why that is not true.
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Presto talk @ Global AI conference 2018 Bostonkbajda
Presented at Global AI Conference in Boston 2018:
http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.
Slides from my talk at SiliconIndia OSS conference in Bangalore Nov 2012. Talks about use of OSS as a competitive advantage, especially in areas like eCommerce citing Flipkart as an example.
My talk from Database Camp 2016 at the United Nations. I focus on how we can bridge the gap between OLTP and OLAP workloads and discuss a very promising new technology called Apache Kudu.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
An Engineering Approach to Database EvaluationsSingleStore
This talk will go over a methodical approach for making a decision, dig into interesting tradeoffs, and give tips about what things to look for under the hood and how to evaluate the tech behind the database.
Short overview of data infrastructure at Bazaarvoice. We use a combination of many different data stores such as MySQL, SOLR, Infobright, MongoDB and Hadoop.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Sistema de recomendación entiempo real usando Delta LakeGlobant
Speaker: Valentina Grajales
Video: https://youtu.be/-R5qFhnyZU0
Presentamos cómo construir un sistema de recomendación en tiempo real con entrenamiento dinámico usando operaciones de ventana en una arquitectura Kappa de Spark Delta Lake.
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Hay trabajos y hay carreras. Las oportunidades vienen a golpear la puerta cuando menos lo esperas. La decisión es tuya. Desde tener la oportunidad de hacer algo significativo día tras día, hasta estar rodeado de gente supremamente inteligente y motivada.
¿Estás listo?
Descúbre todas nuestras oportunidades acá: https://bit.ly/2PWKky9
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Síguenos en:
Facebook: https://www.facebook.com/Globant/
Twitter: https://twitter.com/Globant
Instagram: https://www.instagram.com/globantpics/
Linkedin: https://www.linkedin.com/company/globant
Visita nuestra página web: https://bit.ly/2XLVYQD
Global Knowledge Collaboration to Cure Cancer: How GPUs Impact Graph & Predictive Analytics
Brad Bebee, CEO of Blazegraph
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
This DataStage internet Training will furnish you with the capability expected to work with the IBM DataStage. DataStage is an ETL device that uses a graphical documentation for the combination of information. This is the lead result of IBM in Business Intell
This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
OLTP Vs. OLAP
Types Of OLAP
ROLAP V/s MOLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin
References
For any queries
Contact Us:- argonauts007@gmail.com
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Presto talk @ Global AI conference 2018 Bostonkbajda
Presented at Global AI Conference in Boston 2018:
http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.
Slides from my talk at SiliconIndia OSS conference in Bangalore Nov 2012. Talks about use of OSS as a competitive advantage, especially in areas like eCommerce citing Flipkart as an example.
My talk from Database Camp 2016 at the United Nations. I focus on how we can bridge the gap between OLTP and OLAP workloads and discuss a very promising new technology called Apache Kudu.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
An Engineering Approach to Database EvaluationsSingleStore
This talk will go over a methodical approach for making a decision, dig into interesting tradeoffs, and give tips about what things to look for under the hood and how to evaluate the tech behind the database.
Short overview of data infrastructure at Bazaarvoice. We use a combination of many different data stores such as MySQL, SOLR, Infobright, MongoDB and Hadoop.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Sistema de recomendación entiempo real usando Delta LakeGlobant
Speaker: Valentina Grajales
Video: https://youtu.be/-R5qFhnyZU0
Presentamos cómo construir un sistema de recomendación en tiempo real con entrenamiento dinámico usando operaciones de ventana en una arquitectura Kappa de Spark Delta Lake.
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Hay trabajos y hay carreras. Las oportunidades vienen a golpear la puerta cuando menos lo esperas. La decisión es tuya. Desde tener la oportunidad de hacer algo significativo día tras día, hasta estar rodeado de gente supremamente inteligente y motivada.
¿Estás listo?
Descúbre todas nuestras oportunidades acá: https://bit.ly/2PWKky9
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Síguenos en:
Facebook: https://www.facebook.com/Globant/
Twitter: https://twitter.com/Globant
Instagram: https://www.instagram.com/globantpics/
Linkedin: https://www.linkedin.com/company/globant
Visita nuestra página web: https://bit.ly/2XLVYQD
Global Knowledge Collaboration to Cure Cancer: How GPUs Impact Graph & Predictive Analytics
Brad Bebee, CEO of Blazegraph
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
This DataStage internet Training will furnish you with the capability expected to work with the IBM DataStage. DataStage is an ETL device that uses a graphical documentation for the combination of information. This is the lead result of IBM in Business Intell
This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
OLTP Vs. OLAP
Types Of OLAP
ROLAP V/s MOLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin
References
For any queries
Contact Us:- argonauts007@gmail.com
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...Restlet
Lessons learned by Restlet when deploying DataStax Enterprise search with APISpark. Presentation by Jerome Louvel and Guillaume Blondeau at the Cassandra Summit 2015. Includes 7 challenges and solutions when deploying DataStax.
Slide du petit déjeuner du 11 décembre 2013
Dans un contexte économique délicat, les outils du « big data » apportent toute la rapidité, la souplesse et la scalabilité requise pour mettre en oeuvre des projets d'entreprise tirant profit de volumes d'information importants. Ces technologies sont désormais une réalité à intégrer aux projets SI.
La société Klee Group organise ce déjeuner thématique en proposant des intervenants du Big Data :
- Mongo DB
- Elasticsearch
- CMS Rubedo
Data Warehouse approaches with Dynamics AXAlvin You
Dynamics AX의 BI 구축을 위해 필요한 Data Warehouse 내용입니다.
• What is a Data Warehouse
• Data Warehouse Approaches
• Why Invest in a Data Warehouse
• Getting Started
• BI Models
• BI Solutions
The majority of cloud-based DWH provides a wide range of migration tools from in-house DWH. However, I believe that cloud migration success is based not only on reducing infrastructure maintenance costs, but also on additional performance profit inherited from tailored data model.
I am going to prove that copying star or snowflake schemas as is will not lead to maximum performance boost in such DWH as Amazon Redshift and Google BigQuery. Moreover, this approach may cause additional cloud expenses.
We will discuss why data models should be different for each particular database, and how to get maximum performance from database peculiarities.
Most of performance tuning techniques for cloud-based DWH are about adding extra nodes to cluster, but it may lead to performance degradation in some cases, as well as extra costs burden. Sometimes, this approach allows to get maximum speed from current hardware configuration, may be even less expensive servers.
I will show some examples from production projects with extra performance using lower hardware, and edge cases like huge wide fact table with fully denormalized dimensions instead of classical star schema.
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
This is Nicholas Dritsas, Eric Jacobsen, and my 2007 SQL PASS Summit presentation on designing, building, and maintaining large Analysis Services cubes
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
2. Who am I?
● Sr Principal Architect/Director of Engineering at
Yahoo - Gemini Reporting
● Big Data at GridX, Klout, Ebay/Shopping.com,
Ask.com, and HP using
Hadoop/Hbase/Hive/Pig/Oozie/Ab
Initio/Oracle/DB2
3. Agenda
● What and Why
● Evolving Query Generation
● Why not Kylin or Lens?
● Results
● What’s Next?
4. ● Data Warehouse / OLAP Queries
○ Star Schema
■ Dimensions - reference for
measures, denormalized
■ Facts - measures
○ Snowflake Schema
■ Normalized dimensions
What kind of query?
5. ● OLAP cube is a method of storing data in
a multidimensional form that is optimized
for reporting queries across dimensions
What do we mean by OLAP Cube?
Table Name
ad_stats ad_id ad_grp_id campaign_id advertiser_id spend
ad_grp_stats ad_grp_id campaign_id advertiser_id spend
campaign_stats campaing_id advertiesr_id spend
7. ● Centralize reporting system
● Multiple use cases
● Simple interface
Why query generation?
8. ● Druid
● Apache Spark
● PrestoDB
● Apache Drill
● Kudu
● Impala
● Big Query
What do you choose?
● MemSQL
● Redshift/ParAccel
● Vertica
● Netezza/IBM
● Greenplum
● Teradata
● Exadata/Oracle RAC
9. ● Evolving Technology
○ Start simple
○ Scale the Business
○ Use the right tool for the job
○ Mixture of vertical and horizontal
scaling
○ Support incremental migration
○ Cost of migration
Why multiple data stores?
16. ● SQL construction by inspecting definitions
● Easier to optimize query at construction
time
● Engine specific SQL
Evolving Query Generation - Take 2
17. ● Challenges
○ No intelligence for selecting a data
store beyond available columns
○ Difficult to extend
○ Annotations promoted arbitrary special
casing, duplication
Evolving Query Generation - Take 2
21. ● Easier to add new data stores/engines
through generalization and better
separation of concerns
Evolving Query Generation - Take 3
22. ● Cost based engine selection with
pluggable cost estimators
○ Dimension cost - due to join cardinality
○ Fact cost - due to number of rows
scanned
Evolving Query Generation - Take 3
23. ● Partitioning aware definitions with
pluggable partitioning scheme
Evolving Query Generation - Take 3
24. ● Versioning of cube definitions
● Bucket testing of new definitions
○ User list
○ Internal users
○ Dry run
○ External users
● Timezone aware definitions with
pluggable time provider
Evolving Query Generation - Take 3
26. ● Querying across multiple engines
Evolving Query Generation - Take 3
27. ● Kylin - end to end product for managing
your OLAP needs, ANSI-SQL
● Lens - manages definitions and query
lifecycle, Cube QL
Why not Kylin or Lens?
28. ● Library/Framework approach built upon
Star/Snowflake schema data model
● Easy to customize and optimize
generators
● Simple JSON interface
Why not Kylin or Lens?
31. ● Millions of OLAP queries per day
● 30+ cube definitions across 3 data stores
(Hive, Oracle, Druid)
● Current query generation is 3x faster than
previous version
● 20% less code, more features, better
validation and error handling
Evolving Query Generation - Results
33. ● Add Fact/Dim view support
● Add Fact/Fact join support
● Resource availability for engine selection
● Data availability for engine selection
● Open source (should we?)
Future Work