Wei Li of Alibaba
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
Advancing Real-Time Responses in Web Applications
Michael Glukhovsky, Co-Founder, RethinkDB
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
The document discusses and compares popular big data tools Hadoop, Spark, and PySpark. It provides overviews of the pros and cons of Hadoop and Spark, describing Hadoop as providing linear scalability on commodity hardware for distributed processing of large data sets, while Spark is 100x faster using in-memory computation. It also summarizes Resilient Distributed Datasets (RDDs), Spark's programming model, and how Spark and PySpark can be used with DataFrames and SQL for structured data processing.
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
Building tiered data stores using aesop to bridge sql and no sql systemsRegunath B
Slides from my talk on building tiered data stores using Aesop to bridge SQL and NoSQL data stores. Aesop is a pub-sub like change data capture and propagation system.
1) Aesop is an open source change data capture and propagation tool that reliably captures changes from data sources and propagates them to other data stores and systems to enable eventual consistency across polyglot data platforms.
2) It uses log mining to capture changes from data sources like MySQL and propagates the change events to consumers like ElasticSearch and HBase through an enhanced relay component.
3) It provides utilities for bootstrapping consumers, monitoring and administering the system, and has been used in production at Flipkart for applications including payments, ETL, and data serving.
This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam
Let me know if there is any mistake and I will try to update it
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
Advancing Real-Time Responses in Web Applications
Michael Glukhovsky, Co-Founder, RethinkDB
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
The document discusses and compares popular big data tools Hadoop, Spark, and PySpark. It provides overviews of the pros and cons of Hadoop and Spark, describing Hadoop as providing linear scalability on commodity hardware for distributed processing of large data sets, while Spark is 100x faster using in-memory computation. It also summarizes Resilient Distributed Datasets (RDDs), Spark's programming model, and how Spark and PySpark can be used with DataFrames and SQL for structured data processing.
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
Building tiered data stores using aesop to bridge sql and no sql systemsRegunath B
Slides from my talk on building tiered data stores using Aesop to bridge SQL and NoSQL data stores. Aesop is a pub-sub like change data capture and propagation system.
1) Aesop is an open source change data capture and propagation tool that reliably captures changes from data sources and propagates them to other data stores and systems to enable eventual consistency across polyglot data platforms.
2) It uses log mining to capture changes from data sources like MySQL and propagates the change events to consumers like ElasticSearch and HBase through an enhanced relay component.
3) It provides utilities for bootstrapping consumers, monitoring and administering the system, and has been used in production at Flipkart for applications including payments, ETL, and data serving.
This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam
Let me know if there is any mistake and I will try to update it
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
Big Data Challenges and How to Overcome Them with Qubole - a Self-Service Platform for Big Data Analytics built on Amazon Web Services, Microsoft and Google Clouds. Storing, accessing, and analyzing large amounts of data from diverse sources and making it easily accessible to deliver actionable insights for users can be challenging for data driven organizations. The solution for customers is to optimize scaling and create a unified interface to simplify analysis. Qubole helps customers simplify their big data analytics with speed and scalability, while providing data analysts and scientists self-service access in Cloud. The platform is fully elastic and automatically scales or contracts clusters based on workload. We will try to overview main features, advantages and drawback of this platform.
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
1. The Hadoop Image Processing (HIP) pipeline acquires vehicle images, identifies updates, generates URLs, crops and resizes images, copies them to asset servers, and removes duplicates.
2. It uses HBase for image storage and archiving, MapReduce for image processing, Kafka for publishing to asset servers, OpenCV for image processing, and Avro for data serialization.
3. Performance testing showed HIP scales linearly and is at least 10x faster than the previous system, and using cascading downloads provided a 20% performance gain.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
This document discusses migrating data from MySQL to Amazon Redshift. It describes MySQL and Redshift, and some of the challenges of migrating between the two systems, such as incompatible schemas and manual processes. The proposed solution is to use a cloud data lake with schema-on-read to store JSON event data, which can then be loaded into Redshift, a cloud data warehouse with schema-on-write, providing an automated way to migrate data between different systems and schemas.
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...DataStax
We recently launched DataStax Enterprise 4.5 - the fastest, most scalable distributed database technology with blazing performance, 100x faster analytics and automated diagnostics.
Join DataStax’s product gurus Martin Van Ryswyk, EVP of Engineering, and Robin Schumacher, VP of Products, in an open dialog as they discuss the importance of -
- Selecting the right database technology for today’s digital world
- Integrated analytics for lightning fast customer interactions
- Merging operational and historical data for the most accurate insights, possible
Today's organizations contend with more diverse applications, data, and systems than ever before – silos that are often fragmented and difficult to leverage together. iWay Big Data Integrator (BDI) simplifies the creation, management, and use of Hadoop-based data lakes. It provides a modern, native approach to Hadoop-based data integration and management that ensures high levels of capability, compatibility, and flexibility to help your organization.
Join us to learn how you can simplify adoption of Apache Hadoop using iWay Big Data Integrator. Learn about our ability to streamline the deployment of ingestion, transformation, and extraction tasks.
See the pre-recorded webcast online at: http://www.informationbuilders.com/webevents/online/24427#sthash.J0cRy1PG.dpuf
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
Big data doesn't mean big money. In fact, choosing a NoSQL solution will almost certainly save your business money, in terms of hardware, licensing, and total cost of ownership. What's more, choosing the correct technology for your use case will almost certainly increase your top line as well.
Big words, right? We'll back them up with customer case studies and lots of details.
This webinar will give you the basics for growing your business in a profitable way. What's the use of growing your top line but outspending any gains on cumbersome, ineffective, outdated IT? We'll take you through the specific use cases and business models that are the best fit for NoSQL solutions.
By the way, no prior knowledge is required. If you don't even know what RDBMS or NoSQL stand for, you are in the right place. Get your questions answered, and get your business on the right track to meeting your customers' needs in today's data environment.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
The document discusses powering predictive mapping at scale using the SMACK stack, which includes Spark, Kafka, and Elasticsearch. It describes how the SMACK stack can ingest millions of events per second from connected devices, store the data in Apache Spark, and allow real-time and batch processing of the data. It also provides an example of using the stack for real-time tracking of geo-enabled IoT devices and demonstrates the data flow and a demo of the system.
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
This document summarizes the Central Bank of Turkey's project to develop high frequency market indicators using real-time tick data from the Thomson Reuters Enterprise Platform. It describes how they set up Apache Kafka, Druid, Spark and Superset on Hadoop to ingest, store, analyze and visualize the data. Their goal was to observe foreign exchange markets in real-time to detect risks and patterns. The architecture evolved over three phases from an initial test cluster to integrating Druid and Hive for improved querying and scaling to production. Work is ongoing to implement additional indicators and integrate historical data for enhanced analysis.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
This document summarizes digital transformation with Microsoft Azure, including cloud computing, big data, and data lakes. It discusses data lake characteristics such as structured, semi-structured, and unstructured data. Data lakes are used for reporting, visualization, analytics, and machine learning. They provide a single store for raw and processed data ranging from raw copies of source systems to structured data for analytics. The document also briefly mentions Azure Data Lake Analytics, DataBricks, and concludes by thanking the reader.
This document discusses Presto, an interactive SQL query engine for big data. It describes how Presto is optimized to quickly query data stored in Parquet format at Uber. Key optimizations for Parquet include nested column pruning, columnar reads, predicate pushdown, dictionary pushdown, and lazy reads. Benchmark results show these optimizations improve Presto query performance. The document also provides an overview of Uber's analytics infrastructure, applications of Presto, and ongoing work to further optimize Presto and Hadoop.
1) The presentation discusses Druid, an open source analytics engine that can perform aggregations on memory mapped data in sub-second time.
2) It describes how Druid fits into their software stack at the API layer and how they extend its capabilities through a SQL interface and addressing limitations like limited querying and missing features like distinct counts.
3) Examples of SQL queries against Druid are shown to demonstrate its capabilities like group by, filtering, joins, and handling of timeseries data.
Presto is an open source distributed SQL query engine that allows querying of data across different data sources. It was originally developed by Facebook and is now used by many companies. Presto uses connectors to query various data sources like HDFS, S3, Cassandra, MySQL, etc. through a single SQL interface. Companies like Facebook and Teradata use Presto in production environments to query large datasets across different data platforms.
Operationalizing Big Data Pipelines At ScaleDatabricks
Running a global, world-class business with data-driven decision making requires ingesting and processing diverse sets of data at tremendous scale. How does a company achieve this while ensuring quality and honoring their commitment as responsible stewards of data? This session will detail how Starbucks has embraced big data, building robust, high-quality pipelines for faster insights to drive world-class customer experiences.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Overview of Apache Trafodion (incubating), Enterprise Class Transactional SQL-on-Hadoop DBMS, with operational use cases, what it takes to be a world class RDBMS, some performance information, and the new company Esgyn which will leverage Apache Trafodion for operational solutions.
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
Big Data Challenges and How to Overcome Them with Qubole - a Self-Service Platform for Big Data Analytics built on Amazon Web Services, Microsoft and Google Clouds. Storing, accessing, and analyzing large amounts of data from diverse sources and making it easily accessible to deliver actionable insights for users can be challenging for data driven organizations. The solution for customers is to optimize scaling and create a unified interface to simplify analysis. Qubole helps customers simplify their big data analytics with speed and scalability, while providing data analysts and scientists self-service access in Cloud. The platform is fully elastic and automatically scales or contracts clusters based on workload. We will try to overview main features, advantages and drawback of this platform.
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
1. The Hadoop Image Processing (HIP) pipeline acquires vehicle images, identifies updates, generates URLs, crops and resizes images, copies them to asset servers, and removes duplicates.
2. It uses HBase for image storage and archiving, MapReduce for image processing, Kafka for publishing to asset servers, OpenCV for image processing, and Avro for data serialization.
3. Performance testing showed HIP scales linearly and is at least 10x faster than the previous system, and using cascading downloads provided a 20% performance gain.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
This document discusses migrating data from MySQL to Amazon Redshift. It describes MySQL and Redshift, and some of the challenges of migrating between the two systems, such as incompatible schemas and manual processes. The proposed solution is to use a cloud data lake with schema-on-read to store JSON event data, which can then be loaded into Redshift, a cloud data warehouse with schema-on-write, providing an automated way to migrate data between different systems and schemas.
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...DataStax
We recently launched DataStax Enterprise 4.5 - the fastest, most scalable distributed database technology with blazing performance, 100x faster analytics and automated diagnostics.
Join DataStax’s product gurus Martin Van Ryswyk, EVP of Engineering, and Robin Schumacher, VP of Products, in an open dialog as they discuss the importance of -
- Selecting the right database technology for today’s digital world
- Integrated analytics for lightning fast customer interactions
- Merging operational and historical data for the most accurate insights, possible
Today's organizations contend with more diverse applications, data, and systems than ever before – silos that are often fragmented and difficult to leverage together. iWay Big Data Integrator (BDI) simplifies the creation, management, and use of Hadoop-based data lakes. It provides a modern, native approach to Hadoop-based data integration and management that ensures high levels of capability, compatibility, and flexibility to help your organization.
Join us to learn how you can simplify adoption of Apache Hadoop using iWay Big Data Integrator. Learn about our ability to streamline the deployment of ingestion, transformation, and extraction tasks.
See the pre-recorded webcast online at: http://www.informationbuilders.com/webevents/online/24427#sthash.J0cRy1PG.dpuf
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
Big data doesn't mean big money. In fact, choosing a NoSQL solution will almost certainly save your business money, in terms of hardware, licensing, and total cost of ownership. What's more, choosing the correct technology for your use case will almost certainly increase your top line as well.
Big words, right? We'll back them up with customer case studies and lots of details.
This webinar will give you the basics for growing your business in a profitable way. What's the use of growing your top line but outspending any gains on cumbersome, ineffective, outdated IT? We'll take you through the specific use cases and business models that are the best fit for NoSQL solutions.
By the way, no prior knowledge is required. If you don't even know what RDBMS or NoSQL stand for, you are in the right place. Get your questions answered, and get your business on the right track to meeting your customers' needs in today's data environment.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
The document discusses powering predictive mapping at scale using the SMACK stack, which includes Spark, Kafka, and Elasticsearch. It describes how the SMACK stack can ingest millions of events per second from connected devices, store the data in Apache Spark, and allow real-time and batch processing of the data. It also provides an example of using the stack for real-time tracking of geo-enabled IoT devices and demonstrates the data flow and a demo of the system.
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
This document summarizes the Central Bank of Turkey's project to develop high frequency market indicators using real-time tick data from the Thomson Reuters Enterprise Platform. It describes how they set up Apache Kafka, Druid, Spark and Superset on Hadoop to ingest, store, analyze and visualize the data. Their goal was to observe foreign exchange markets in real-time to detect risks and patterns. The architecture evolved over three phases from an initial test cluster to integrating Druid and Hive for improved querying and scaling to production. Work is ongoing to implement additional indicators and integrate historical data for enhanced analysis.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
This document summarizes digital transformation with Microsoft Azure, including cloud computing, big data, and data lakes. It discusses data lake characteristics such as structured, semi-structured, and unstructured data. Data lakes are used for reporting, visualization, analytics, and machine learning. They provide a single store for raw and processed data ranging from raw copies of source systems to structured data for analytics. The document also briefly mentions Azure Data Lake Analytics, DataBricks, and concludes by thanking the reader.
This document discusses Presto, an interactive SQL query engine for big data. It describes how Presto is optimized to quickly query data stored in Parquet format at Uber. Key optimizations for Parquet include nested column pruning, columnar reads, predicate pushdown, dictionary pushdown, and lazy reads. Benchmark results show these optimizations improve Presto query performance. The document also provides an overview of Uber's analytics infrastructure, applications of Presto, and ongoing work to further optimize Presto and Hadoop.
1) The presentation discusses Druid, an open source analytics engine that can perform aggregations on memory mapped data in sub-second time.
2) It describes how Druid fits into their software stack at the API layer and how they extend its capabilities through a SQL interface and addressing limitations like limited querying and missing features like distinct counts.
3) Examples of SQL queries against Druid are shown to demonstrate its capabilities like group by, filtering, joins, and handling of timeseries data.
Presto is an open source distributed SQL query engine that allows querying of data across different data sources. It was originally developed by Facebook and is now used by many companies. Presto uses connectors to query various data sources like HDFS, S3, Cassandra, MySQL, etc. through a single SQL interface. Companies like Facebook and Teradata use Presto in production environments to query large datasets across different data platforms.
Operationalizing Big Data Pipelines At ScaleDatabricks
Running a global, world-class business with data-driven decision making requires ingesting and processing diverse sets of data at tremendous scale. How does a company achieve this while ensuring quality and honoring their commitment as responsible stewards of data? This session will detail how Starbucks has embraced big data, building robust, high-quality pipelines for faster insights to drive world-class customer experiences.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Overview of Apache Trafodion (incubating), Enterprise Class Transactional SQL-on-Hadoop DBMS, with operational use cases, what it takes to be a world class RDBMS, some performance information, and the new company Esgyn which will leverage Apache Trafodion for operational solutions.
Architectural Evolution Starting from HadoopSpagoWorld
Speech given by Monica Franceschini, Solution Architecture Manager at the Big Data Competencey Center of Engineering Group, in occasion of the Data Driven Innovation Rome 2016 - Open Summit.
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
HBase is used at China Telecom for various applications including persistence for streaming jobs, online reading and writing, and as a data store for their core system. They operate several HBase clusters storing over 500 TB of data ingesting 1 TB per day. They monitor HBase using Ganglia for basic metrics and Zabbix for critical alerts. When issues arise, such as a system hang, they investigate debug cases and perform optimizations like changing the garbage collector from CMS to G1 and implementing read/write splitting.
Horizon is a distributed SQL database that allows users to query and analyze big data stored in HBase using a familiar SQL interface. It uses the H2 database engine and customizes HBase's data model to provide features like indexing, partitioning, and SQL support. Horizon aims to make big data more accessible while maintaining HBase's scalability. It will integrate with Hadoop ecosystems and provide high performance data loading, scanning, and analysis tools. Horizon's architecture distributes the SQL engine across servers and uses HBase as the distributed storage layer.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
This document discusses SQL Server and big data analytics projects in the real world. It covers the big data technology landscape, big data analytics, and three big data analytics scenarios using different technologies like Hadoop, MongoDB, and SQL Server. It also discusses SQL Server's role in the big data world and how to get data into Hadoop for analysis.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for the storage and analysis of datasets that are too large for single servers. The document discusses several key Hadoop components including HDFS for storage, MapReduce for processing, HBase for column-oriented storage, Hive for SQL-like queries, Pig for data flows, and Sqoop for data transfer between Hadoop and relational databases. It provides examples of how each component can be used and notes that Hadoop is well-suited for large-scale batch processing of data.
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
Azure Big Data: “Got Data? Go Modern and Monetize”.
In this session you will learn how to architected, developed, and build completely in the open, Hortonworks Data Platform (HDP) that provides an enterprise ready data platform to adopt a Modern Data Architecture.
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
New Journey of HBase in Alibaba and Cloud discusses Alibaba's use of HBase over 8 years and improvements made. Key points discussed include:
- Alibaba began using HBase in 2010 and has since contributed to the open source community while developing internal improvements.
- Challenges addressed include JVM garbage collection pauses, separating computing and storage, and adding cold/hot data tiering. A diagnostic system was also created.
- Alibaba uses HBase across many core scenarios and has integrated it with other databases in a multi-model approach to support different workloads.
- Benefits of running HBase on cloud include flexibility, cost savings, and making it
The session covers how to get started to build big data solutions in Azure. Azure provides different Hadoop clusters for Hadoop ecosystem. The session covers the basic understanding of HDInsight clusters including: Apache Hadoop, HBase, Storm and Spark. The session covers how to integrate with HDInsight in .NET using different Hadoop integration frameworks and libraries. The session is a jump start for engineers and DBAs with RDBMS experience who are looking for a jump start working and developing Hadoop solutions. The session is a demo driven and will cover the basics of Hadoop open source products.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
The document discusses Paytm Labs' transition from batch data ingestion to real-time data ingestion using Apache Kafka and Confluent. It outlines their current batch-driven pipeline and some of its limitations. Their new approach, called DFAI (Direct-From-App-Ingest), will have applications directly write data to Kafka using provided SDKs. This data will then be streamed and aggregated in real-time using their Fabrica framework to generate views for different use cases. The benefits of real-time ingestion include having fresher data available and a more flexible schema.
The presentation covers how to get started to build big data solutions in Azure. Azure provides different Hadoop clusters for Hadoop ecosystem. The session covers the basic understanding of HDInsight clusters including: Apache Hadoop, HBase, Storm and Spark. The session covers how to integrate with HDInsight in .NET using different Hadoop integration frameworks and libraries. The session is a jump start for engineers and DBAs with RDBMS experience who are looking for a jump start working and developing Hadoop solutions. The session is a demo driven and will cover the basics of Hadoop open source products.
Generic presentation about Big Data Architecture/Components. This presentation was delivered by David Pilato and Tugdual Grall during JUG Summer Camp 2015 in La Rochelle, France
Hortonworks Data Platform 2.2 includes Apache HBase for fast NoSQL data access. In this 30-minute webinar, we discussed HBase innovations that are included in HDP 2.2, including: support for Apache Slider; Apache HBase high availability (HA); block ache compression; and wire-level encryption.
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
Les mégadonnées représentent un vrai enjeu à la fois technique, business et de société
: l'exploitation des données massives ouvre des possibilités de transformation radicales au
niveau des entreprises et des usages. Tout du moins : à condition que l'on en soit
techniquement capable... Car l'acquisition, le stockage et l'exploitation de quantités
massives de données représentent des vrais défis techniques.
Une architecture big data permet la création et de l'administration de tous les
systèmes techniques qui vont permettre la bonne exploitation des données.
Il existe énormément d'outils différents pour manipuler des quantités massives de
données : pour le stockage, l'analyse ou la diffusion, par exemple. Mais comment assembler
ces différents outils pour réaliser une architecture capable de passer à l'échelle, d'être
tolérante aux pannes et aisément extensible, tout cela sans exploser les coûts ?
Le succès du fonctionnement de la Big data dépend de son architecture, son
infrastructure correcte et de son l’utilité que l’on fait ‘’ Data into Information into Value ‘’.
L’architecture de la Big data est composé de 4 grandes parties : Intégration, Data Processing
& Stockage, Sécurité et Opération.
SnappyData is a new open source project started by Pivotal GemFire founders to provide a unified platform for OLTP, OLAP and streaming analytics using Spark. It aims to simplify big data architectures by supporting mixed workloads in a single clustered database, allowing for real-time operational analytics on live data without copying to other systems. This provides faster insights than current approaches that require periodic data copying between different databases and analytics systems.
Similar to hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark (20)
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on CloudMichael Stack
Long Chen
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Recent work on HBase at PinterestMichael Stack
Lianghong Xu
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., LtdMichael Stack
Yechao Chen
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
TianHang Tang
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...Michael Stack
Xu Ming
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Andrew Cheng
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...Michael Stack
Fei Xiao of Alibaba
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...Michael Stack
Huan-Ping Su (蘇桓平), Yi-Sheng Lien (連奕盛) National Cheng Kung University
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Pharos as a Pluggable Secondary Index ComponentMichael Stack
Lei Wang China Everbright Bank
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Junhong Xu of Xiaomi
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBaseMichael Stack
Pradeep S, Mallikarjun V of Flipkart
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Distributed Bitmap Index SolutionMichael Stack
Xingjun Hao of Huawei
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 HBase Bucket Cache on Persistent MemoryMichael Stack
Anoop Sam John, Ramkrishna S Vasudevan, and Xu Kai of Intel
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACLMichael Stack
Mei Yi of Xiaomi
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 BDS: A data synchronization platform for HBaseMichael Stack
熊嘉男
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...Michael Stack
Anoop Sam John of Intel and Zheng Hu of Alibaba
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...Michael Stack
The document discusses HBCK2, a tool for fixing issues in HBase 2. Some key points:
1. HBCK2 is simpler than HBCK1, with fewer fix commands and no diagnosis commands. It requires a deeper understanding of HBase internals.
2. HBCK2 commands are master-oriented and fix issues one at a time. Common issues include regions not online, stuck procedures, and tables in the wrong state.
3. Recipes are provided to fix specific issues like missing meta regions or regions in transition using HBCK2 commands like assigns and bypass.
4. HBCK2 is still a work in progress but contributions are welcome
Keynote given by Duo Zhang of Xiaomi and Chunhui Shen of Alibab
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
HBaseConAsia2018 Track3-1: Serving billions of queries in millisecond latenciesMichael Stack
This document discusses how Bloomberg uses HBase to serve billions of queries with millisecond latency. It covers HBase principles like being an ordered key-value store and providing ACID transactions. It also discusses modeling data for HBase, including dealing with data and query skew. Implementation details covered include caching, block size tuning, column families, and compaction. The overall goal is to optimize HBase for Bloomberg's low-latency data storage and retrieval needs.
Understanding User Behavior with Google Analytics.pdfSEO Article Boost
Unlocking the full potential of Google Analytics is crucial for understanding and optimizing your website’s performance. This guide dives deep into the essential aspects of Google Analytics, from analyzing traffic sources to understanding user demographics and tracking user engagement.
Traffic Sources Analysis:
Discover where your website traffic originates. By examining the Acquisition section, you can identify whether visitors come from organic search, paid campaigns, direct visits, social media, or referral links. This knowledge helps in refining marketing strategies and optimizing resource allocation.
User Demographics Insights:
Gain a comprehensive view of your audience by exploring demographic data in the Audience section. Understand age, gender, and interests to tailor your marketing strategies effectively. Leverage this information to create personalized content and improve user engagement and conversion rates.
Tracking User Engagement:
Learn how to measure user interaction with your site through key metrics like bounce rate, average session duration, and pages per session. Enhance user experience by analyzing engagement metrics and implementing strategies to keep visitors engaged.
Conversion Rate Optimization:
Understand the importance of conversion rates and how to track them using Google Analytics. Set up Goals, analyze conversion funnels, segment your audience, and employ A/B testing to optimize your website for higher conversions. Utilize ecommerce tracking and multi-channel funnels for a detailed view of your sales performance and marketing channel contributions.
Custom Reports and Dashboards:
Create custom reports and dashboards to visualize and interpret data relevant to your business goals. Use advanced filters, segments, and visualization options to gain deeper insights. Incorporate custom dimensions and metrics for tailored data analysis. Integrate external data sources to enrich your analytics and make well-informed decisions.
This guide is designed to help you harness the power of Google Analytics for making data-driven decisions that enhance website performance and achieve your digital marketing objectives. Whether you are looking to improve SEO, refine your social media strategy, or boost conversion rates, understanding and utilizing Google Analytics is essential for your success.
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
Instagram has become one of the most popular social media platforms, allowing people to share photos, videos, and stories with their followers. Sometimes, though, you might want to view someone's story without them knowing.
Ready to Unlock the Power of Blockchain!Toptal Tech
Imagine a world where data flows freely, yet remains secure. A world where trust is built into the fabric of every transaction. This is the promise of blockchain, a revolutionary technology poised to reshape our digital landscape.
Toptal Tech is at the forefront of this innovation, connecting you with the brightest minds in blockchain development. Together, we can unlock the potential of this transformative technology, building a future of transparency, security, and endless possibilities.
9. Spark analysis HBase Data
!
!
!
!
HBase
!
Phoenix!
DataSource API
Get API Scan API
Snapshotregion
Filter
Get Scan
Filter
Snapshot
newAPIHadoopRDD
put/create
API
PhoenixInputFormat!
filter
Multi get Range Scan
TableSnapshot
InPutFormat
Spark on HBase
Spark Parser
GetPartition!
Required
Columns
PrunedFilter!
!
!
!
Schema
Mapping
Performance
• distributed scan;
• sql optimize like partition pruning column pruning predicate
pushdown
• direct reading hifles
• auto transform to column based storage
Spark on HBase
16. HBase X-Pack Product recommendation platform
Scenario: With the increasing number of users accumulating in the APP, the customer is ready to
launch the product recommendation function, which requires real-time ETL analysis, storage and
model calculation of the user behavior log.
17. HBase X-Pack: Integrated data processing platform
Pain points
• Online HBase and offline analysis shared clusters affect
online HBase query performance
• Spark&hive sql directly reads and writes HBase in bulk,
affecting HBase stability
values
• HBase data is asynchronously archived to Spark number
warehouse, which has no effect on online
• After Spark analysis, the result data is transferred by the bulkload
method, which does not affect the online business.
18. HBase X-Pack: Big data risk control platform
Spark
SQL MLlib
( HDFS)
( )
+ +
Parquet
(HDFS OSS)
Load
Kafka
Spark Streaming
HBase
• Real-time news: Kafka accepts real-time collected messages, and can do simple things with smoke streaming
• Archive by day increment: Data increments that are streamed to the storage service each day are archived to the spark offline warehouse
• Offline data warehouse: used to store the full amount of data, the data is stored in HDFS on the column.
• Full training model: spark supports complex computing, mlib, python is suitable for data computing training model
• Model data Load: The new model Loaded to the model service for offline service to provide external control decision
• Risk control simulation: When adding a heart rule or a new model at the training center, verify its good or bad, you can use the full amount of
data to do training in the spark offline warehouse.
19. HBase X-Pack: Game log processing platform
https://yq.aliyun.com/articles/702337?spm=a2c4e.11163080.searchblog.27.154c2ec1x4glPb
values
• Support high performance offline computing and real-time
computing;
• Manage data job scheduling;
• Support for elastic scaling calculations (cost savings)
• Support hot and cold storage (cost saving)
• Meet data lake scenarios and support high-throughput mass
storage structured and unstructured data;
20. HBase X-Pack:Real-time scene
values
• Pre-computation generates a common indicator layer, and uses
HBase&Solr's real-time analysis and processing capabilities to meet
real-time report calculations of different services.
• Pre-calculation is to use spark streaming, the delay is less than 10s
• Spark streaming can be used with hbase to do de-weighting,
correlation dimension table
21. HBase X-Pack:offline data warehouse
-
!
!
!
-
! ! ! !
! ! ! !
!
Spark
Spark
Streaming
PolarDB RDS ADB HBase Mongo Redis Spark
Spark ( Parquet HIVEMeta)
!
!
!
!
!
• Operational data layer: The most primitive data in the message middleware is similar to Kafka, LogHUB, or in online databases such as PolarDB, RDS, Mongo, HBase, etc.
• Detail wide surface layer: Use the Spark batch ETL or Spark Streaming table to build a detailed wide table
• Public summary wide surface layer: Classification and modeling in Spark according to certain business themes, such as daily/monthly reports, model training, etc.
• Public dimension surface: static dimension table
• Data application layer: high-level summary data processed by offline number bins is stored in the online library for query service.
22. Ø If you are interested in the online sql analysis engine
Ø If you are interested in the spark kernel and ecosystem
We are hiring!
ApsaraDB HBase X-Pack:
https://help.aliyun.com/document_detail/93899.html