The document discusses Apache Kylin, an open source distributed analytics engine that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop for extremely large datasets. It provides an overview of Kylin's features such as sub-second query latency, ANSI SQL support, and seamless integration with BI tools. The document also covers Kylin's architecture, cube storage in HBase, query processing using Calcite, and optimization techniques for cube building.
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets and subsecond query latency.
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
Apache Kylin (incubating) is a new project to bring OLAP cubes to Hadoop. I walk through the project and describe how it works and how users see the project.
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets and subsecond query latency.
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
Apache Kylin (incubating) is a new project to bring OLAP cubes to Hadoop. I walk through the project and describe how it works and how users see the project.
During Kylin OLAP development, we setup many engineering principles in the team. These principles are very important to delivery Kylin with high quality and on schedule.
Experimentation plays a vital role in business growth at eBay by providing valuable insights and prediction on how users will reach to changes made to the eBay website and applications. On a given day, eBay has several hundred experiments running at the same time. Our experimentation data processing pipeline handles billions of rows user behavioral and transactional data per day to generate detailed reports covering 100+ metrics over 50 dimensions.
In this session, we will share our journey of how we moved this complex process from Data warehouse to Hadoop. We will give an overview of the experimentation platform and data processing pipeline. We will highlight the challenges and learnings we faced implementing this platform in Hadoop and how this transformation led us to build a scalable, flexible and reliable data processing workflow in Hadoop. We will cover our work done on performance optimizations, methods to establish resilience and configurability, efficient storage formats and choices of different frameworks used in the pipeline.
Apache Kylin on HBase: Extreme OLAP engine for big dataShi Shao Feng
Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets.
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Kylin Open Source Web Site: http://kylin.io
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
Hongbin Ma and Luke Han (Kyligence)
Apache Kylin is an open source distributed analytics engine that provides a SQL interface and multi-dimensional analysis on Hadoop supporting extremely large datasets. In the forthcoming Kylin release, we optimized query performance by exploring the potentials of parallel storage on top of HBase. This talk explains how that work was done.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
OLTP Vs. OLAP
Types Of OLAP
ROLAP V/s MOLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin
References
For any queries
Contact Us:- argonauts007@gmail.com
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
Apache Kylin general introduction, including background, business needs and technical challenges, theory and architecture, features and some tech detail. Following with performance and benchmark, finally, ecosystem and roadmap.
More detail, please visit http://kylin.io or follow @ApacheKylin.
Fraud Detection in Real-time @ Apache Big Data ConSeshika Fernando
Here is the slide deck I used for my talk at #apachebigdata con.
Download whitepaper: http://wso2.com/whitepapers/fraud-detection-and-prevention-a-data-analytics-approach/
During Kylin OLAP development, we setup many engineering principles in the team. These principles are very important to delivery Kylin with high quality and on schedule.
Experimentation plays a vital role in business growth at eBay by providing valuable insights and prediction on how users will reach to changes made to the eBay website and applications. On a given day, eBay has several hundred experiments running at the same time. Our experimentation data processing pipeline handles billions of rows user behavioral and transactional data per day to generate detailed reports covering 100+ metrics over 50 dimensions.
In this session, we will share our journey of how we moved this complex process from Data warehouse to Hadoop. We will give an overview of the experimentation platform and data processing pipeline. We will highlight the challenges and learnings we faced implementing this platform in Hadoop and how this transformation led us to build a scalable, flexible and reliable data processing workflow in Hadoop. We will cover our work done on performance optimizations, methods to establish resilience and configurability, efficient storage formats and choices of different frameworks used in the pipeline.
Apache Kylin on HBase: Extreme OLAP engine for big dataShi Shao Feng
Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets.
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Kylin Open Source Web Site: http://kylin.io
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
Hongbin Ma and Luke Han (Kyligence)
Apache Kylin is an open source distributed analytics engine that provides a SQL interface and multi-dimensional analysis on Hadoop supporting extremely large datasets. In the forthcoming Kylin release, we optimized query performance by exploring the potentials of parallel storage on top of HBase. This talk explains how that work was done.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
This presentation contains following slides,
Introduction To OLAP
Data Warehousing Architecture
The OLAP Cube
OLTP Vs. OLAP
Types Of OLAP
ROLAP V/s MOLAP
Benefits Of OLAP
Introduction - Apache Kylin
Kylin - Architecture
Kylin - Advantages and Limitations
Introduction - Druid
Druid - Architecture
Druid vs Apache Kylin
References
For any queries
Contact Us:- argonauts007@gmail.com
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
Apache Kylin general introduction, including background, business needs and technical challenges, theory and architecture, features and some tech detail. Following with performance and benchmark, finally, ecosystem and roadmap.
More detail, please visit http://kylin.io or follow @ApacheKylin.
Fraud Detection in Real-time @ Apache Big Data ConSeshika Fernando
Here is the slide deck I used for my talk at #apachebigdata con.
Download whitepaper: http://wso2.com/whitepapers/fraud-detection-and-prevention-a-data-analytics-approach/
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
We use "SaasBase Analytics" to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in realtime. The requirement we started from was to get large amounts of data available in near realtime (minutes) to large amounts of users for large amounts of (different) queries that take milliseconds to execute. This set our problem apart from classical solutions such as Hive and PIG. In this talk I`ll go through the design of the solution and the strategies (and hacks) to achieve low latency and scalability from theoretical model to the entire process of ETL to warehousing and queries.
IBM Message Hub service in Bluemix - Apache Kafka in a public cloudAndrew Schofield
This talk was presented at the Kafka Meetup London meeting on 20 January 2016. You can find more information about Message Hub here: http://ibm.biz/message-hub-bluemix-catalog
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1FQYcP0.
Gian Merlino presents the advantages, challenges, and best practices to deploying and maintaining lambda architectures in the real world, using the infrastructure at Metamarkets as a case study. Filmed at qconsf.com.
Gian Merlino is a senior software engineer at Metamarkets, responsible for the infrastructure behind its data ingestion pipelines and is a committer on the Druid project.
Online Fraud Detection Using Big Data Analytics WebinarDatameer
With the ease and convenience of the internet, shopping online has never been faster and simpler than with a click of a button. But with this convenience, lurks the consequence for online fraud.
Companies and merchants lose valuable time and money to online thieves scamming the web. Learn how to identify patterns with Datameer and Trustev as they demonstrate how to take control of the situation and combat combat against suspicious activity by using big data analytics.
In this webinar, you will take away:
*An understanding of the complexities and challenges of online fraud today
*Best practices for merchants and companies to protect themselves from fraud
*A demonstration of fraud reporting, prevention and prediction
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
Kylin is an open source distributed analytics engine contributed by eBay that provides a SQL interface and OLAP on Hadoop supporting extremely large datasets. Kylin's pre-built MOLAP cubes (stored in HBase), distributed architecture, and high concurrency helps users analyze multidimensional queries via SQL and other BI tools. During this session, you'll learn how Kylin uses HBase's key-value store to serve SQL queries with relational schema.
Cloud-native Semantic Layer on Data LakeDatabricks
With larger volume and more real-time data stored in data lake, it becomes more complex to manage these data and serve analytics and applications. With different service interfaces, data caliber, performance bias on different scenarios, the business users begin to suffer low confidence on quality and efficiency to get insight from data.
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
Learn about the latest advancements in Apache Kylin and how its OLAP technology is making analytics faster and insights more actionable.
Learn more about Apache Kylin: https://kyligence.io/apache-kylin-overview/
Learn more about Apache Kylin's enterprise version Kyligence: https://kyligence.io/
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
Celtra provides a platform for streamlined ad creation and campaign management used by customers including Porsche, Taco Bell, and Fox to create, track, and analyze their digital display advertising. Celtra’s platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Celtra’s Grega Kešpret leads a technical dive into Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake’s cloud data warehouse with Spark to get the best of both.
Topics include:
- Why Celtra changed its pipeline, materializing session representations to eliminate the need to rerun its pipeline
- How and why it decided to use Snowflake rather than an alternative data warehouse or a home-grown custom solution
- How Snowflake complemented the existing Spark environment with the ability to store and analyze deeply nested data with full consistency
- How Snowflake + Spark enables production and ad hoc analytics on a single repository of data
When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit
OLAP Cube is about pre-aggregations, it reduces the query latency by spending more time and resources on data preparation. But for real-time analytics, data preparation and visibility latency are critical. What happens when OLAP cube meets real-time use cases?
Can we pre-build the cubes in real-time with a quick and more cost effective way? This is hard but still doable.
In eBay,we built our own real-time OLAP solution based on Apache Kylin & Apache Kafka. We read unbounded events from Kafka cluster then divide the streaming data into 3 stages, In-Memory Stage (Continuously In-Memory Aggregations) , On Disk Stage (Flush to disk, columnar based storage and indexes) and Full Cubing Stage (with MR or Spark, save to HBase). Data are aggregated to different layers in different stage, but all query able. Data will be transformed from 1 stage to another stage automatically and transparent to user.
This solution is built to support quite a few realtime analytics use cases in eBay, we will share some use cases like site speed monitoring and eBay site deal performance in this session as well.
Speaker:
Qiaoneng Qian, Senior Product Manager, eBay
How we evolved data pipeline at Celtra and what we learned along the wayGrega Kespret
Presented at Data Science Meetup on 4/12/2018.
In this talk, Grega Kespret (head of analytics group) will present Celtra’s data analytics pipeline and how it evolved through the years - sometimes forward, sometimes backward. On this journey, we became early adopter of different technologies: BigQuery, Vertica (pre-join projections), Spark (version 0.5), Databricks (beta users) and Snowflake (one of the first users). As the business grew and the product evolved, volume and complexity of data increased ten-fold, as has the number of users generating insights from this data. How come BigQuery did not scale? Why was choosing Vertica a mistake for our use case, and what have we learned from it? What requirements did we have for the analytics database, why did we have to abandon MySQL, and why we finally chose Snowflake? This talk will be heavily opinionated and will describe our experience and learnings - what worked for us and what didn't.
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
Analyze Big Data for Consumer Applications with Looker BI and Amazon Redshift Customizing the customer experience based on user behavior is a constant challenge for today’s consumer apps. Business intelligence helps analyze and model large amounts of data. Looker offers a modern approach to BI leveraging AWS that’s fast, agile, and easy to manage. Join this webinar to learn how MessageMe, which provides emotionally engaging messaging apps to consumers, leverages Looker business intelligence software and the Amazon Redshift data warehouse service to analyze billions of rows of customer data in seconds.
Webinar topics include:
• How MessageMe turns billions of rows of customer data stored in Amazon Redshift into actionable insights
• How Looker connects directly to Amazon Redshift in just a few clicks, enabling MessageMe to build a modern, big data analytics in the cloud. Who should attend
• Information or Solution Architects, Data Analysts, BI Directors, DBAs, Development Leads, Developers, or Technical IT Leaders.
Presenters:
• Justin Rosenthal, CTO, MessageMe
• Keenan Rice, VP, Marketing & Alliances, Looker
• Tina Adams, Senior Product Manager, AWS
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
Challenges in Data Analytics:
Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.
Google BigQuery is the future of Analytics! (Google Developer Conference)Rasel Rana
Google Developer Group (GDG) Sonargaon is a community based focused group for developers on Google and related technologies. I tried to cover a topic on Big Data & BigQuery which is the future of analytics.
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
Similar to Apache Kylin @ Big Data Europe 2015 (20)
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
3. Big Data @ eBay
$28B
GMV VIA MOBILE
(2014)
266M
MOBILE APP
GLOBALLY
1B
LISTINGS CREATED
VIA MOBILE
157M
ACTIVE BUYERS
25M
ACTIVE SELLERS
800M
ACTIVE LISTINGS
8.8M
NEW LISTINGS
EVERY WEEK
6. Extreme OLAP Engine for Big Data
Apache Kylin is an open source Distributed Analytics Engine
from eBay that provides SQL interface and multi-dimensional
analysis (OLAP) on Hadoop supporting extremely large
datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Open Sourced on Oct 1st, 2014
• Accepted as Apache Incubator Project on Nov 25th, 2014
7. Business Needs for Big Data Analysis
Sub-second query latency on billions of rows
ANSI SQL for both analysts and engineers
Full OLAP capability to offer advanced functionality
Seamless Integration with BI Tools
Support of high cardinality and high dimensions
High concurrency – thousands of end users
Distributed and scale out architecture for large data volume
8. Extremely Fast OLAP Engine at Scale
Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
ANSI SQL Interface on Hadoop
Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions
Seamless Integration with BI Tools
Kylin currently offers integration capability with BI Tools like Tableau.
Interactive Query Capability
Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive
queries for the same dataset
MOLAP Cube
User can define a data model and pre-build in Kylin with more than 10+ billions of raw
data records
Features Highlights
9. Apache Kylin Adoption
External
25+ contributors in community
Adoption:
In Production: eBay, Baidu Maps
Evaluation: Huawei, Bloomberg Law,
British Gas, JD.com
Microsoft, Tableau…
eBay Use Cases
Case Cube Size Raw Records
User Session Analysis 14+TB 75+ billion rows
Traffic Analysis 21+ TB 20+ billion rows
Behavior Analysis 560+ GB 1.2+ billion rows
14. Kylin Architecture Overview
14
Cube Builder (MapReduce…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
Online Analysis Data Flow
Offline Data Flow
Clients/Users interactive with
Kylin via SQL
OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST Server
DataSource
Abstraction
Engine
Abstraction
Storage
Abstraction
16. OLAP Cube – Balance between Space and Time
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
• Cuboid = one combination of dimensions
• Cube = all combination of dimensions (all cuboids)
17. Dynamic data management framework.
Formerly known as Optiq, Calcite is an Apache incubator project, used by
Apache Drill and Apache Hive, among others.
http://optiq.incubator.apache.org
Querying the Cube
18. • Metadata SPI
– Provide table schema from Kylin metadata
• Optimization Rules
– Translate the logic operators into kylin operators
• Relational Operators
– Find right cube
– Translate SQL into storage engine API call
– Generate physical execute plan by linq4j java implementation
• Result Enumerator
– Translate storage engine result into java implementation result.
• SQL Function
– Add HyperLogLog for distinct count
– Implement date time related functions (i.e. Quarter)
Query Engine based on Calcite
19. Full Cube
Pre-aggregate all dimension combinations
“Curse of dimensionality”: N dimension cube has 2N cuboid.
Partial Cubes
To avoid dimension explosion, we divide the dimensions into
different aggregation groups
2N+M+L 2N + 2M + 2L
For cube with 30 dimensions, if we divide these dimensions into 3
group, the cuboid number will reduce from 1 Billion to 3 Thousands
230 210 + 210 + 210
Tradeoff between online aggregation and offline pre-aggregation
Optimizing Cube Builds – Full Cube vs. Partial
Cube
22. Optimizing Cube Builds: Improved Parallelism
• The current algorithm
– Many MRs, the number of dimensions
– Huge shuffles, aggregation at reduce side, 100x of total cube size
Full Data
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
MR
MR
MR
MR
MR
23. Optimizing Cube Builds: Improved Parallelism
• The to-be algorithm, 30%-50% faster
– 1 round MR
– Reduced shuffles, map side aggregation, 20x total cube size
– Hourly incremental build done in tens of minutes
Data Split
Cube Segment
Data Split
Cube Segment
Data Split
Cube Segment
……
Final Cube
Merge Sort
(Shuffle)
mapper mapper mapper
29. Plugin-able storage engine
Common iterator interface for storage engine
Isolate query engine from underline storage
Translate cube query into HBase table scan
Columns, Groups Cuboid ID
Filters -> Scan Range (Row Key)
Aggregations -> Measure Columns (Row Values)
Scan HBase table and translate HBase result into cube result
HBase Result (key + value) -> Cube Result (dimensions + measures)
How to Query Cube?
Storage Engine
30. Compression and Encoding Support
Incremental Refresh of Cubes
Approximate Query Capability for distinct Count (HyperLogLog)
Leverage HBase Coprocessor for query latency
Job Management and Monitoring
Easy Web interface to manage, build, monitor and query cubes
Security capability to set ACL at Cube/Project Level
Support LDAP Integration
Features Highlights…
31. Data Modeling
Cube: …
Fact Table: …
Dimensions: …
Measures: …
Storage(HBase): …Fact
Dim Dim
Dim
Source
Star Schema
row A
row B
row C
Column Family
Val 1
Val 2
Val 3
Row Key Column
Target
HBase Storage
Mapping
Cube Metadata
End User Cube Modeler Admin
32. OLAP Cube – Balance between Space and Time
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
• Cuboid = one combination of dimensions
• Cube = all combination of dimensions (all cuboids)
33. • Different ages have different analysis need, resulting in different storage
architecture
• Inverted index may keep data long enough to support row level analysis
Optimizing Cube Builds: Streaming
ts ts … ts ts ts …
Streaming
ts ts
min min … min min … min min
hour hour … hour hour …
Last Hour
inverted index
in memory
row level
Last Week
cube segments
in memory
minute granularity
Months
cube segments
on disk
hour granularity
cubing
cubing
34. Transaction
Operation
Strategy
High Level
Aggregation
•Very High Level, e.g GMV by
site by vertical by weeks
Analysis
Query
•Middle level, e.g GMV by site by vertical,
by category (level x) past 12 weeks
Drill Down
to Detail
•Detail Level (Summary Table)
Low Level
Aggregation
•First Level
Aggragation
Transaction
Level
•Transaction Data
Analytics Query Taxonomy
OLAP
Kylin is designed to accelerate 80+% analytics queries performance on Hadoop
OLTP
36. Huge volume data
Table scan
Big table joins
Data shuffling
Analysis on different granularity
Runtime aggregation expensive
Map Reduce job
Batch processing
Technical Challenges
37. Big Data Era
More and more data becoming available on Hadoop
Limitations in existing Business Intelligence (BI) Tools
Limited support for Hadoop
Data size growing exponentially
High latency of interactive queries
Scale-Up architecture
Challenges to adopt Hadoop as interactive analysis system
Majority of analyst groups are SQL savvy
No mature SQL interface on Hadoop
OLAP capability on Hadoop ecosystem not ready yet
38. Streaming Cubes
• Cube takes time to build, how about real-time
analysis?
• Sometimes we want to drill down to row level
information
• Streaming Cubing
– Build micro cube segments from streaming
– Use inverted index to capture last minute data
42. Kylin vs. Hive
# Query
Type
Return Dataset Query
On Kylin (s)
Query
On Hive (s)
Comments
1 High Level
Aggregation
4 0.129 157.437 1,217 times
2 Analysis Query 22,669 1.615 109.206 68 times
3 Drill Down to
Detail
325,029 12.058 113.123 9 times
4 Drill Down to
Detail
524,780 22.42 6383.21 278 times
5 Data Dump 972,002 49.054 N/A
0
50
100
150
200
SQL #1 SQL #2 SQL #3
Hive
Kylin
High Level
Aggregatio
n
Analysis
Query
Drill Down
to Detail
Low Level
Aggregatio
n
Transactio
n Level
Based on 12+B records case
47. Kylin Core
Fundamental framework of
Kylin OLAP Engine
Extension
Plugins to support for
additional functions and
features
Integration
Lifecycle Management
Support to integrate with
other applications
Interface
Allows for third party users to
build more features via user-
interface atop Kylin core
Driver
ODBC and JDBC Drivers
Kylin OLAP
Core
Extension
Security
Redis Storage
Spark Engine
Docker
Interface
Web Console
Customized BI
Ambari/Hue Plugin
Integration
ODBC Driver
ETL
Drill
SparkSQL
Kylin Ecosystem
48. Cubing Efficiency
MR is not optimal framework
Spark Cubing Engine
Source from SparkSQL
Read data from SparkSQL instead of Hive
Route to SparkSQL
Unsupported queries be coved by SparkSQL
Adding Spark Support
49. Hive
Input source
Pre-join star schema during cube building
MapReduce
Pre-aggregation metrics during cube building
HDFS
Store intermediated files during cube building.
HBase
Store data cube.
Serve query on data cube.
Coprocessor is used for query processing.
Calcite
Query Parsing
Query Execution
The DRY Principle
50. If you want to go fast, go alone.
If you want to go far, go together.
--African Proverb
http://kylin.io
Editor's Notes
Hello, My Name is Seshu Adunuthula and with me is Branky Shao. A quick Intro about myself, I am responsible for the Analytics Infrastructure at eBay which include Hadoop, Teradata, ETL and BI. The talk today will be on Apache Kylin, why we built Kylin and the latest features we have added to Kylin.
Agenda for the talk.
Some of the Stats on Big Data at eBay.
The rationale for building Apache Kylin
New capabilities we are introducing in 0.8 release, by integrating with Streaming
Branky will talk about the SEO Optimization Usecase with Streaming.
Some numbers to address the aspects of scale at eBay.
157M Active Buyers
25M Active Sellers: This is the key differentiation from our competitor up North. A lot of what we do with Analytics is centered around making the Seller successful.
A key to that is the ability to analyze and put structure around the 800M listings we have on eBay with more than 8.8M listings being added every day.
As with every one else more than 40% of the eBay revenue being generated on mobile devices and growing.
Data we capture can be classified into Behavioral data, or Click Stream analysis on User Behavior on the eBay web site, and Transactional data, which is the actual sale made, a listing added or an update in an auction. We retain up to 9 quarters of this data for seasonal analysis. The Applications that get built on this data are your what you would expect
Personalization: personalize the site based on your historical browsing behavior
Targeting via email and mobile
Deals
Improving our search capabilities
and SEO.
Evolution of Hadoop infrastructure…need driving innovation in the infrastructure.
Hadoop started out as a Single use system for Batch Apps only where the compute engine MR did both data processing and cluster resource management.
First generation Hadoop met the need for affordable scalability and flexible data structure, and provided the democratization needed on the big data investment to give companies big and small the power and affordability of big data. However, it was only the first step. Limitations, like its batch-oriented job processing and consolidated resource management, drove the development of Yet Another Resource Negotiator (YARN), which was needed to take Hadoop to the next level.
Summary: 1st Gen Hadoop was a monolithic system. If you compare it to say Windows O/S, imagine the O/S and applications such as Notepad being tied together. Also, there was only 1 single app, i.e., Map Reduce. This was limiting and not flexible.
In its second major release, Hadoop took a giant leap forward – from batch oriented to enabling interactive capabilities. It also included the strategic decision to support the innovation of more independent execution models, including MapReduce as a YARN application separate from the data persistence of HDFS. Thus, via YARN, Hadoop became a true data operating system and Evolution of the Hadoop Platform Driving the Next Generation Data Architecture with Hadoop Adoption 3 platform for YARN applications, and the newer, improved two-tier framework enabled parallel innovation of data engines (both as Apache projects and in vendor-proprietary products). (Data lake strategy)
Now, data from transaction oriented databases, document databases, and graph databases can all be stored within Hadoop, and also accessed via YARN applications without duplicating or moving data for a different workload usage. While most major applications have not yet begun to migrate to Hadoop or NoSQL data stores, this trend is expected to grow as the benefits of having a unified data repository capable of data reuse without duplication or latency are realized.
The Hadoop ecosystem enables innovation in Hadoop engines and YARN applications that will allow some to increase capabilities and performance (like Apache Hive) while other engines continue to mature and innovate (like Apache Drill). As different YARN applications for SQL access continue to improve, a critical factor will be those optimized for transaction processing with both record read and write operations with transaction integrity and atomicity, consistency, isolation, and durability (or, ACID) capability. This will enable current SQL-based applications to migrate to the Hadoop platform. And, as these migrations began to happen, Hadoop will enable new benefits to be realized from having the data already in the data operating system for other applications to leverage, rather than deciding whether it’s cost effective and timely to move data.
What is Kylin
We derived the project name for a mythical animal part of a chinese legend part of set of four divine creatures including a pheonix, a dragon and a turtle. One other reason we chose it, It is supposed to live for 2000 years, if Apache Kylin gets 1/100 of its life we will be happy…
So What is Kylin, It is an OLAP engine built for extreme scale on big data technology including Hive and Hbase.
We have been working on Hive for more than a year and open sourced in Oct and accepted into Apache Incubation in Nov.
So why did we build Kylin.
Like any enterprise wanting to internally deployed commercial OLAP engines like Microstrategy but soon starting hitting limits on Scalability.
We were collecting behavioral data at the rate of 20TB a day. A monthly analysis required peta bytes scale Cubes.
There was no products available commercially or in open source that supported cubes at this scale…
We set out to build Kylin with the following requirements.
Interactive Query capability on billions of rows of data.
Full ANSI SQL Compliance
OLAP style cubes
Tight integration with BI tools like Tableau and Microstrategy
Also support for large number of dimensions and large cardinality for each dimension
Highly Available & Secure
Fundamentally different in that it is built ground up distributed and supported for scale out.
Since we open sourced last November, we have seen rapid adoption in the industry, more than 25+ contributors, an active dev mailling list, In production outside of eBay at Baidu Maps and couple other companies. In evaluation at a variety of locations, there has been interest in Support Model around this, unfortunately we are not there yet. It is still DIY if you want to deploy it.
Within eBay some of the biggest use cases are the Traffic Data 21 TB cube and User Session data with 75 Billion rows and 14 TB cubes.
A High Level Architecture for Kylin which is a Standard MOLAP Architecture built on Hadoop.
Data Sources to build your MOLAP Cubes primarily Hive, We have a fantastic project in the works for a Storage Abstraction Layer and support other NoSQL Stores such as Cassandra/CouchBase.
An Engine Abstraction which maintains the Cube Metadata and a Cube Builder. Today a set of Map Reduce Jobs to build the cubes.
A storage layer to store the Cubes in Hbase, primarily through a Bulk Load of the aggregrates into Hbase.
We are looking for active community participation to build out additional Data Source, Engine and Storage plugins into Kylin.
A Query Engine that directly index into the multi-dimensional arrays built into Hbase.
Drilling a bit into the Storage Layer
- You have a set of measures on which you want to compute your aggregates and the dimensions you want to aggregate on. The Hbase RowKey is a combination of a CuboidID and set of Dimension IDs. The Row Value is aggregates for all the measures.
What gets stored in Hbase is a lattice of Cuboids.
In Data warehousing literature a base cuboid stores the base measures and the apex cubiod captures the highest level of summarization. A data cubes is lattice of all cuboids between the base and the apex. For e.g. L1 has the base measure, L2 here has the aggregation for supplier, L2 has supplier and time and so on.
Each of these values are stored in Hbase using the Row key format we had talked about earlier.
Once we have stored the multi-demensional array into Hbase it needs to be queried. We built the query engine based on Calcite. Calcite has several extensions points that we implement against. 1. The Metadata SPI, 2. Pluggable rules in the query optimizer to efficiently execute the SQL queries and integrate with Storage APIs.
A few more details, will skip because of time concerns, but this is the core of the Kylin engine, if you are interested in details please follow up on the Kylin Dev List.
All of this is good. The efficiency of any MOLAP Solutions rests in its ability to efficiently build the cubes. As the number and dimensionality of the dimensions increases you quickly run into combinatorial explosion which will render the MOLAP solutions quite useless.
The next few slides we will walk thru a series of ways to make the Cube building process as efficient as possible.
Step 1. is to avoid dimension explosion by dividing your dimensions into aggregate groups. A lot of time you notice that you aggregate time and location or time and category, but not all three together. You create separate aggregation groups that will help build your cubes faster and take up less space at the expense of improving query time latency.
The unaggregated results are now computed at query time.
In Step 2, based on the cardinality of the dimensions, we automatically prune some of the cuboids in the lattice, here the systems determines the set of cubiods that can be pruned based on
Another approach to optimizing cube build times is the incremental building.
The initial cube creation is done with the all available measure values.
Subsequent cubes built on daily refreshes of the data
A query spans the each of the cube segments that constitute the cube
Periodically cube segments are merged together to improve query performance
The existing algorithm is a series of Map Reduce Jobs building different levels in the lattice of Cuboids.
Sequencing the different MR jobs increases the delay in building the cubes
The Aggregations were done on the Reduce side resulting in 100X the total cube size
Obviously this was an area ripe for improvement.
With Map Side Aggregates the amount of data shuffles is reduced from 100X to 20X. A single MR job now computes all the different cuboids in the lattice. This also paves way to the support for Streaming or more accurately building cubes on Micro Batches as they arrive.
With the addition of Streaming capabilities
Kylin now acts as a Kafka Consumer
Subscribes to Kafka Topics
Messages from the topic are kept in memory and an inverted index is maintained on the rows as they arrive.
The Kylin Query Engine is extended to aggregate results from the in memory tables
As the data ages cube segments are built and combined with other cube segments.
Measure eBay pages SEO performance: how’s free traffic from major search engine contributing to eBay business
10 Billion Raw Events
5 Million Sessions
10% of the traffic is generated through SEO
We are able to drill down on the different dimensions Site/Country/Search Engine/Page/Device
Some additional details
For eg. Here is have a measure with dimensions Year, Country and Category, The Hbase Rows stores cartesian product of the aggregates