Apache Spark is a great solution for building Big Data applications. It provides really fast SQL-like processing, machine learning library, and streaming module for near real time processing of data streams. Unfortunately, during application development and production deployments we often encounter many difficulties in mixing various data sources or bulk loading of computed data to SQL or NoSQL databases
https://www.bigdataspain.org/2017/talk/apache-spark-vs-rest-of-the-world-problems-and-solutions
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!
https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
Current workshop diagnostics are based on manually-generated decision trees. This approach is increasingly reaching its limits due to a growing variant diversity, and the increasing complexity of vehicle systems. This session will describe BMW’s new Apache Spark-enabled approach: Use the available data from cars and workshops to train models that are able to predict the right part to switch, or the action to take.
You’ll get an overview and presentation of BMW’s complete pipeline including ETL, model training based on Spark 2.1, serializing results along with metadata and serving the gained insights as Web-App. You’ll also hear how Spark helped BMW leverage the information from millions of observations and thousands of features, and learn what pitfalls they experienced (e.g. setting up a working dev-toolchain, working with 50K features, parallelizing well), and how you can avoid them.
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!
https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
Current workshop diagnostics are based on manually-generated decision trees. This approach is increasingly reaching its limits due to a growing variant diversity, and the increasing complexity of vehicle systems. This session will describe BMW’s new Apache Spark-enabled approach: Use the available data from cars and workshops to train models that are able to predict the right part to switch, or the action to take.
You’ll get an overview and presentation of BMW’s complete pipeline including ETL, model training based on Spark 2.1, serializing results along with metadata and serving the gained insights as Web-App. You’ll also hear how Spark helped BMW leverage the information from millions of observations and thousands of features, and learn what pitfalls they experienced (e.g. setting up a working dev-toolchain, working with 50K features, parallelizing well), and how you can avoid them.
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
The workshop implements an innovative fraud detection solution as a PoC for a bank who provides payment processing services for commerce to their merchant customers all across the globe, helping them save costs by applying machine learning and advanced analytics to detect fraudulent transactions. Since their customers are around the world, the right solutions should minimize any latencies experienced using their service by distributing as much of the solution as possible, as closely as possible, to the regions in which their customers use the service. The workshop designs a data pipeline solution that leverages Cosmos DB for both the scalable ingest of streaming data, and the globally distributed serving of both pre-scored data and machine learning models. Cosmos DB’s major advantage when operating at a global scale is its high concurrency with low latency and predictable results.
This combination is unique to Cosmos DB and ideal for the bank needs. The solution leverages the Cosmos DB change data feed in concert with the Azure Databricks Delta and Spark capabilities to enable a modern data warehouse solution that can be used to create risk reduction solutions for scoring transactions for fraud in an offline, batch approach and in a near real-time, request/response approach. https://github.com/Microsoft/MCW-Cosmos-DB-Real-Time-Advanced-Analytics Takeaway: How to leverage Azure Cosmos DB + Azure Databricks along with Spark ML for building innovative advanced analytics pipelines.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
Assessing Graph Solutions for Apache SparkDatabricks
Users have several options for running graph algorithms with Apache Spark. To support a graph data architecture on top of its linear-oriented DataFrames, the Spark platform offers GraphFrames. However, due to the fact that GraphFrames are immutable and not a native graph, there are cases where it might not offer the features or performance needed for certain use cases. Another option is to connect Spark to a real-time, scalable and distributed native graph database such as TigerGraph.
In this session, we compare three options — GraphX, Cypher for Apache Spark, and TigerGraph — for different types of workload requirements and data sizes, to help users select the right solution for their needs. We also look at the data transfer and loading time for TigerGraph.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
The workshop implements an innovative fraud detection solution as a PoC for a bank who provides payment processing services for commerce to their merchant customers all across the globe, helping them save costs by applying machine learning and advanced analytics to detect fraudulent transactions. Since their customers are around the world, the right solutions should minimize any latencies experienced using their service by distributing as much of the solution as possible, as closely as possible, to the regions in which their customers use the service. The workshop designs a data pipeline solution that leverages Cosmos DB for both the scalable ingest of streaming data, and the globally distributed serving of both pre-scored data and machine learning models. Cosmos DB’s major advantage when operating at a global scale is its high concurrency with low latency and predictable results.
This combination is unique to Cosmos DB and ideal for the bank needs. The solution leverages the Cosmos DB change data feed in concert with the Azure Databricks Delta and Spark capabilities to enable a modern data warehouse solution that can be used to create risk reduction solutions for scoring transactions for fraud in an offline, batch approach and in a near real-time, request/response approach. https://github.com/Microsoft/MCW-Cosmos-DB-Real-Time-Advanced-Analytics Takeaway: How to leverage Azure Cosmos DB + Azure Databricks along with Spark ML for building innovative advanced analytics pipelines.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
Assessing Graph Solutions for Apache SparkDatabricks
Users have several options for running graph algorithms with Apache Spark. To support a graph data architecture on top of its linear-oriented DataFrames, the Spark platform offers GraphFrames. However, due to the fact that GraphFrames are immutable and not a native graph, there are cases where it might not offer the features or performance needed for certain use cases. Another option is to connect Spark to a real-time, scalable and distributed native graph database such as TigerGraph.
In this session, we compare three options — GraphX, Cypher for Apache Spark, and TigerGraph — for different types of workload requirements and data sizes, to help users select the right solution for their needs. We also look at the data transfer and loading time for TigerGraph.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB
What's the Scoop on MongoDB & Hadoop
Jake Angerman, Sr. Solutions Architect, MongoDB
MongoDB Evenings Dallas
March 30, 2016 at the Addison Treehouse, Dallas, TX
Apache parquet - Apache big data North America 2017techmaddy
Apache Parquet brings the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. Apache Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces. Apache Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Apache Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented. This talk highlights the internal implementation of Apache Parquet.
This is Apache Pig & Pig Latin Session.
We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc
Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-and-hadoop-online-instructor-led-training
JSON_TO_HIVE_SCHEMA_GENERATOR is a tool that effortlessly converts your JSON data to Hive schema, which then can be used with HIVE to carry out processing of data. It is designed to automatically generate hive schema from JSON Data. It keeps into account various issues(multiple JSON objects per file, NULL Values, the absence of certain fields etc..) and can parse millions of records and obtain a schema definition for data i:e nested structures.
Follow : https://github.com/jainpayal12/Json_To_HiveSchema_Generator.git
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...George McGeachie
An opportunity I could not miss - a 2-hour presentation on modelling to an audience of database experts!
Starting with a brief look at using Visio and/or Excel for data modelling and governance, I talked about the extras we can gain by using PowerDesigner to design databases.
Of course, it's not 'just' databases we're concerned with, the relationships those databases have with our business and technical architecture is also important.
The next key topic is the role of Data Models and others (such as the Requirements model) in Governance and Design.
Next it's mapping data sources and targets to demonstrate and create data lineage, showing how PowerDesigner supports multiple DBMS Versions (and what you can do to change how it does that), creating a CUBE for Business Objects, finally (almost) focusing on the support provided for SAP IQ.
Finally, I described some real-world uses of PowerDesigner.
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
Insights can only be as good as the data. The data quality domain is enormously large, so you need to understand your company pain points to know what to focus on first.
https://www.bigdataspain.org/2017/talk/big-data-big-quality
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
2gether is a financial platform based on Blockchain, Big Data and Artificial Intelligence that allows interaction between users and third-party services in a single interface.
https://www.bigdataspain.org/2017/talk/scaling-a-backend-for-a-big-data-and-blockchain-environment
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.
https://www.bigdataspain.org/2017/talk/disaster-recovery-for-big-data
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
The power of this new set of tools for Data Science. Is really easy to start applying these technics in your current workflow.
https://www.bigdataspain.org/2017/talk/data-science-for-lazy-people-automated-machine-learning
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.
https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
Time series related problems have traditionally been solved using engineered features obtained by heuristic processes.
https://www.bigdataspain.org/2017/talk/state-of-the-art-time-series-analysis-with-deep-learning
Big Data Spain 2017
November 16th - 17th
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
Not long ago only banks and hedge funds could afford doing automated and High Frequency Trading, that is, the ability to send buy commodities in microseconds intervals.
https://www.bigdataspain.org/2017/talk/trading-at-market-speed-with-the-latest-kafka-features
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.
https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale.
https://www.bigdataspain.org/2017/talk/the-analytic-platform-behind-ibms-watson-data-platform
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
Artificial Intelligence and Data-centric businesses.
https://www.bigdataspain.org/2017/talk/tbc
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
The Meme of the Internet Index will be the new normal to analyze and predict facts and sensations which go around the Internet.
https://www.bigdataspain.org/2017/talk/meme-index-analyzing-fads-and-sensations-on-the-internet
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
Geotab is a leader in the expanding world of Internet of Things (IoT) and telematics industry with Big Data.
https://www.bigdataspain.org/2017/talk/vehicle-big-data-that-drives-smart-city-advancement
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
Primary function of banking sector is promoting economic activity; which means “commerce”, exchanging what someone produces-has for something that someone consumes-desires.
https://www.bigdataspain.org/2017/talk/more-people-less-banking-blockchain
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.
https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.
https://www.bigdataspain.org/2017/talk/feature-selection-for-big-data-advances-and-challenges
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017
1.
2. Apache Spark vs rest of the world
- Problems and Solutions
Arkadiusz Jachnik
3. #BigDataSpain 2017
About Arkadiusz
• Senior Data Scientist at AGORA SA
- user profiling & content personalization
- recommendation system
• PhD Student at
Poznan University of Technology
- multi-class & multi-label classification
- multi-output prediction
- recommendation algorithms
2
4. #BigDataSpain 2017
Agora’s BigData Team
3
my boss Luiza :) it’s me!
we are all here
at #BDS!
I invite to talk of these guys :)
Arek Wojtek
Paweł
Paweł
Dawid
Bartek Jacek Daniel
6. #BigDataSpain 2017
Spark in Agora's BigData Platform
5
DATA COLLECTING AND INTEGRATION
USER PROFILING
SYSTEM DATA ANALYTICSRECOMMENDATION
SYSTEM
DATA ENRICHMENT AND CONTENT STRUCTURISATION
HADOOP CLUSTER
own build, v2.2
structuredstreaming
Spark SQL, MLlib
Spark
streaming
over 3 years of experience
7. #BigDataSpain 2017
Today discussed problems
6
1. Processing parts of data and loading from
Spark to relational database in parallel
2. Bulk loading do HBase database
3. From relational database to Spark DataFrame
(with user defined functions)
4. From HBase to Spark by Hive external table
(with timestamps of HBase cells)
5. Spark Streaming with Kafka - how to implement
own offset manager
8. #BigDataSpain 2017
I will show some code…
• I will show real technical problems we have
encountered during Spark deployment
• We use Spark in Agora for over 3 years so
we have great experience
• I will present practical solutions showing
some code in Scala
• Scala is natural for Spark
7
9. 1. Processing and writing parts of data in parallel
Problem description:
• We have processed huge
DataFrame of computed
recommendations for users
• There are 4 defined types of
recommendations
• For each type we want to take
top-K recommendations for each
user
• Recommendations of each type
should be loaded to different
PostgreSQL table
#BigDataSpain 20178
User
Recommendation
type
Article Score
Grzegorz TYPE_3 Article F 1.0
Bożena TYPE_4 Article B 0.2
Grażyna TYPE_2 Article B 0.2
Grzegorz TYPE_3 Article D 0.9
Krzysztof TYPE_3 Article D 0.4
Grażyna TYPE_2 Article C 0.9
Grażyna TYPE_1 Article D 0.3
Bożena TYPE_2 Article E 0.9
Grzegorz TYPE_1 Article E 1.0
Grzegorz TYPE_1 Article A 0.7
10. #BigDataSpain 2017
Code intro: input & output
9
Grzegorz, Article A, 1.0
Grzegorz, Article F, 0.9
Grzegorz, Article C, 0.9
Grzegorz, Article D, 0.8
Grzegorz, Article B, 0.75
Bożena, ... ...
TYPE1
5recos.peruser
save table_1
Krzysztof, Article F, 1.0
Krzysztof, Article D, 1.0
Krzysztof, Article C, 0.8
Krzysztof, Article B, 0.85
Grażyna, Article C, 1.0
Grażyna, ... ...
TYPE2
4recos.peruser
save table_2
Grzegorz, Article E, 1.0
Grzegorz, Article B, 0.75
Grzegorz, Article A, 0.8
Bożena, Article E, 0.9
Bożena, Article A, 0.75
Bożena, Article C 0.75
TYPE3
3recos.peruser
save table_3
Grażyna, Article A, 1.0
Grażyna, Article F, 0.9
Bożena, Article B, 0.9
Bożena, Article D, 0.9
Grzegorz, Article B, 1.0
Grzegorz, Article E, 0.95
TYPE4
2recos.peruser
save table_4
11. #BigDataSpain 2017
Standard approach
recoTypes.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
10
no-parallelism parallelism but most of the tasks skipped
12. #BigDataSpain 2017
maybe we can add .par ?
recoTypes.par.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
11
parallelism but too much tasks :(
13. #BigDataSpain 2017
Our trick
parallelizeProcessing(recoTypes, (recoType: RecoType) => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = {
f(recoTypes.head)
if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_))
}
12
execute Spark action for the first type…
parallelize the rest
14. 2. Fast bulk-loading to HBase
Problems with standard HBase
client (inserts with Put class):
• Difficult integration with Spark
• Complicated parallelization
• For non pre-splited tables problem
with *Region*Exception-s
• Slow for millions of rows
#BigDataSpain 201713
Spark DataFrame / RDD
.foreachPartition
hTable
.put(…)
hTable
.put(…)
hTable
.put(…)
hTable
.put(…)
15. #BigDataSpain 2017
Idea
Our approach is based on:
https://github.com/zeyuanxy/
spark-hbase-bulk-loading
Input RDD:
data: RDD[( //pair RDD
Array[Byte], //HBase row key
Map[ //data:
String, //column-family
Array[(
String, //column name
(String, //cell value
Long) //timestamp
)]
]
)]
14
General idea:
We have to save our RDD data as HFiles
(HBase data are stored in such files) and load
them into the given pre-existing table.
General steps:
1. Implement Spark Partitioner that defines
how our data in a key-value pair RDD
should be partitioned for HBase row key
2. Repartition and sort the RDD within
column-families and starting row keys for
every HBase region
3. Save RDD to HDFS as HFiles by
rdd.saveAsNewAPIHadoopFile method
4. Load files to table by
LoadIncrementalHFiles (HBase API)
16. #BigDataSpain 2017
Implementation
// Prepare hConnection, tableName, hTable ...
val regionLocator =
hConnection.getRegionLocator(tableName)
val columnFamilies = hTable.getTableDescriptor
.getFamiliesKeys.map(Bytes.toString(_))
val partitioner = new
HFilePartitioner(regionLocator.getStartKeys, fraction)
// prepare partitioned RDD
val rdds = for {
family <- columnFamilies
rdd = data
.collect{ case (key, dataMap) if
dataMap.contains(family) => (key, dataMap(family))}
.flatMap{ case (key, familyDataMap) =>
familyDataMap.map{
case (column: String, valueTs: (String, Long)) =>
(((key, Bytes.toBytes(column)), valueTs._2),
Bytes.toBytes(valueTs._1))
}
}
} yield getPartitionedRdd(rdd, family, partitioner)
15
val rddToSave = rdds.reduce(_ ++ _)
// prepare map-reduce job for bulk-load
HFileOutputFormat2.configureIncrementalLoad(
job, hTable, regionLocator)
// prepare path for HFiles output
val fs = FileSystem.get(hbaseConfig)
val hFilePath = new Path(...)
try {
rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString,
classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], job.getConfiguration)
// prepare HFiles for incremental load by setting
// folders permissions read/write/exec for all...
setRecursivePermission(hFilePath)
val loader = new LoadIncrementalHFiles(hbaseConfig)
loader.doBulkLoad(hFilePath, hConnection.getAdmin,
hTable, regionLocator)
} // finally close resources, ...
Prepare HBase
connection, table
and region locator
Prepare Spark
partitioner for
HBase regions
Repartition and sort
data within partitions
by the partitioner
Save HFiles by
NewAPIHadoopFile
to HDFS
Load HFiles
to HBase table
17. #BigDataSpain 2017
Keep in mind
• Set optimally HBase parameter:
hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32)
• For large data too small value of this parameter may causes
IllegalArgumentException: Size exceeds Integer.MAX_VALUE
• Create HBase tables with splits adapted to the expected row keys
- example: for row keys of HEX IDs create table with splits like:
create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',
‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']}
- for further single puts it minimizes *Region*Exceptions
16
18. #BigDataSpain 2017
3. Loading data from Postgres to Spark
This is possible for data from Hive:
val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val data: DataFrame = sparkSesstion.sql(
"SELECT id, toUpperCaseUdf(code) FROM types"
)
17
But this is not possible for data from
JDBC (for example PostgreSQL):
val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT toUpperCaseUdf(code) " +
"FROM codes) as codesData",
connectionConf)
this query is executed
by Postgres (not Spark)
here you can can specify
just Postgres table name
and how to parallelize
data loading?
19. #BigDataSpain 2017
Try to load ’raw’ data without UDFs and next
use .withColumn with UDF as expression:
val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT code " +
"FROM codes) as codesData",
connectionConf)
.withColumn("upperCode",
expr("toUpperCaseUdf(code)"))
Our solution
18
.jdbc produces
DataFrame
We will split the table read across executors
on the selected column:
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(
url = jdbcUrl,
table = "(SELECT code, type_id " +
"FROM codes) as codesData",
columnName = "type_id",
lowerBound = 1L,
upperBound = 100L,
numPartitions = 10,
connectionProperties = connectionConf)
but it’s one partition!
21. #BigDataSpain 2017
4. From HBase to Spark by Hive
There are commonly used method for loading
data from HBase to Spark by Hive external
table:
CREATE TABLE hive_view_on_hbase (
key int,
value string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key, cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);
20
72A9DBA74524
column-family: cities
Poznan Warsaw Cracow Gdansk
40 5 1 3
58383B36275A
Poznan Warsaw Cracow Gdansk
120 60 5
009D22419988
Poznan Warsaw Cracow Gdansk
75 1
user_id cities_map last_city
72A9DBA
74524
map(Poznan->40, Warsaw->5,
Cracow->1, Gdansk->3)
?
58383B3
6275A
map(Warsaw->120,
Cracow->60, Gdansk->5)
?
009D224
19988
map(Poznan->75, Warsaw->1) ?
HiveHBaseHandler
but how to get the last
(most recent) values?
where aretimestamps?
22. #BigDataSpain 2017
Our case
• We use HDP distribution of Hadoop cluster
with HBase 1.1.x
• There is possibility to add to Hive view on
HBase table the latest timestamp of row
modification:
CREATE TABLE hive_view_on_hbase (
key int,
value string,
ts timestamp
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key, cf1:val, :timestamp'
)
TBLPROPERTIES (
'hbase.table.name' = 'xyz'
);
21
• How to extract timestamp of each cell?
• Answer: rewrite Hive-HBase-Handler that is
responsible for creating the Hive views on
HBase tables :) … but first …
• Do not download source code of Hive
from the Hive GitHub repository - check
your Hadoop distribution! (for example
HDP has own code branch)
24. #BigDataSpain 2017
There is a lot of code…
…but we have some tips on how to change Hive-HBase-Handler:
• Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java
which returns ColumnMappings object
• LazyHBaseRow class stores data from HBase row.
• Timestamps of processed HBase cells can be read from loaded (by scanner) rows in
LazyHBaseCellMap class
• Column parser and HBase scanner is initialized in HBaseStorageHandler.java
23
25. #BigDataSpain 2017
5. Spark + Kafka: own offset manager
Problem description:
• Spark output operations are at-least-once
• For exactly-once semantics, you must store
offsets after an idempotent output, or in an
atomic transaction alongside output
• Options:
1. Checkpoints
+ easy to enable by Spark checkpointing
- output operation must be idempotent
- cannot recover from a checkpoint if
application code has changed
2. Own data store
+ regardless of changes to your application
code
+ you can use data stores that support
transactions
+ exactly-once semantics
24
Single Spark batch
Process
and save data
Save
offsets
Image source: Spark Streaming documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html
26. #BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(…)
val stream: DStream[ConsumerRecord[String, String]] = ...
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, ...)
})
25
Single Spark batch
Process
and save data
Save
offsets
27. #BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(...)
val stream: DStream[ConsumerRecord[String, String]] =
kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams)
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, zkPath)
})
def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore,
kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = {
offsetsStore.readOffsets(topic, zkPath) match {
case Some(offsetsMap) =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap))
case None =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams)
)
}
}
26
28. #BigDataSpain 2017
Code of offset store
class MyOffsetsStore(zkHosts: String) {
val zkUtils = ZkUtils(zkHosts, 10000, 10000, false)
def saveOffsets(rdd: RDD[_], zkPath: String): Unit = {
val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetsRanges.groupBy(_.topic).foreach {
case (topic, offsetsRangesPerTopic) => {
val offsetsRangesStr = offsetsRangesPerTopic
.map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",")
zkUtils.updatePersistentPath(zkPath, offsetsRangesStr)
}
}}
def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = {
val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath)
offsetsRangesStrOpt match {
case Some(offsetsRangesStr) =>
Some(offsetsRangesStr.split(",").map(s => s.split(":")).map {
case Array(partitionStr, offsetStr) =>
new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong
}.toMap)
case None => None
}
}
}
27