Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Jump-start your application migration to AWS with CloudEndure - STG305 - New ...Amazon Web Services
CloudEndure Migration is a no-cost solution for moving applications from any physical, virtual, or cloud-based infrastructure to AWS. It simplifies, expedites, and reduces the cost of cloud migration by offering a highly automated lift-and-shift solution. In this workshop, you learn how to install a CloudEndure agent, monitor initial replication process steps in the console, configure target blueprints, and launch test instances in AWS.
Operations: Production Readiness Review – How to stop bad things from HappeningAmazon Web Services
There is more to deploying code than pushing the deploy button. A good practice that many companies follow is a Production Readiness Review (PRR) which is essentially a pre-flight check list before a service launches. This helps ensure new services are properly architected, monitored, secured, and more. We’ll walk through an example PRR and discuss the value of ensuring each of these is properly taken care of before your service launches.
Data Migration Plan PowerPoint Presentation SlidesSlideTeam
Data transfer is a complex process for every business. Keep this in mind we have created Data Migration Plan PowerPoint Presentation Slides. There are various slides provided in this information transfer plan PowerPoint complete deck such as data migration approach, steps, a simplified illustration of data migration steps, lifecycle, process, data migration on the cloud, and many more. Our team of experts uses all sorts of editable charts, icons and graphs to design these impressive presentation slides. The content ready information transfer PPT visuals are fully editable. You can modify, color, text, and font size. It has relevant templates to cater to your business needs. Outline all the important concepts without any hassle. Furthermore, data migration strategy PPT slides are apt to present related concepts like data conversion, data curation, data preservation, system migration to name a few. Showcase varied ways of data transformation using this professionally designed information migration PPT visual.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
Flink and Kafka are popular components to build an open source stream processing infrastructure. We present how Flink integrates with Kafka to provide a platform with a unique feature set that matches the challenging requirements of advanced stream processing applications. In particular, we will dive into the following points:
Flink’s support for event-time processing, how it handles out-of-order streams, and how it can perform analytics on historical and real-time streams served from Kafka’s persistent log using the same code. We present Flink’s windowing mechanism that supports time-, count- and session- based windows, and intermixing event and processing time semantics in one program.
How Flink’s checkpointing mechanism integrates with Kafka for fault-tolerance, for consistent stateful applications with exactly-once semantics.
We will discuss “”Savepoints””, which allows users to save the state of the streaming program at any point in time. Together with a durable event log like Kafka, savepoints allow users to pause/resume streaming programs, go back to prior states, or switch to different versions of the program, while preserving exactly-once semantics.
We explain the techniques behind the combination of low-latency and high throughput streaming, and how latency/throughput trade-off can configured.
We will give an outlook on current developments for streaming analytics, such as streaming SQL and complex event processing.
In this hands-on workshop, we'll explore how to deploy resources to azure using terraform. First we'll peek into the basics of terraform (HCL language, CLI, providers, provisioners, modules, plans, state files etc).
Then in our hand-on exercise, we'll author terraform scripts to deploy virtual networks, virtual machines and app services to azure. Finally we'll walk through some azure tooling & integrations for terraform (azure cloud shell, hosted images in azure devops, azure marketplace images, VSCode extensions etc).
Author: Mithun Shanbhag
Data Streaming with Apache Kafka & MongoDBconfluent
Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
Jump-start your application migration to AWS with CloudEndure - STG305 - New ...Amazon Web Services
CloudEndure Migration is a no-cost solution for moving applications from any physical, virtual, or cloud-based infrastructure to AWS. It simplifies, expedites, and reduces the cost of cloud migration by offering a highly automated lift-and-shift solution. In this workshop, you learn how to install a CloudEndure agent, monitor initial replication process steps in the console, configure target blueprints, and launch test instances in AWS.
Operations: Production Readiness Review – How to stop bad things from HappeningAmazon Web Services
There is more to deploying code than pushing the deploy button. A good practice that many companies follow is a Production Readiness Review (PRR) which is essentially a pre-flight check list before a service launches. This helps ensure new services are properly architected, monitored, secured, and more. We’ll walk through an example PRR and discuss the value of ensuring each of these is properly taken care of before your service launches.
Data Migration Plan PowerPoint Presentation SlidesSlideTeam
Data transfer is a complex process for every business. Keep this in mind we have created Data Migration Plan PowerPoint Presentation Slides. There are various slides provided in this information transfer plan PowerPoint complete deck such as data migration approach, steps, a simplified illustration of data migration steps, lifecycle, process, data migration on the cloud, and many more. Our team of experts uses all sorts of editable charts, icons and graphs to design these impressive presentation slides. The content ready information transfer PPT visuals are fully editable. You can modify, color, text, and font size. It has relevant templates to cater to your business needs. Outline all the important concepts without any hassle. Furthermore, data migration strategy PPT slides are apt to present related concepts like data conversion, data curation, data preservation, system migration to name a few. Showcase varied ways of data transformation using this professionally designed information migration PPT visual.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
Flink and Kafka are popular components to build an open source stream processing infrastructure. We present how Flink integrates with Kafka to provide a platform with a unique feature set that matches the challenging requirements of advanced stream processing applications. In particular, we will dive into the following points:
Flink’s support for event-time processing, how it handles out-of-order streams, and how it can perform analytics on historical and real-time streams served from Kafka’s persistent log using the same code. We present Flink’s windowing mechanism that supports time-, count- and session- based windows, and intermixing event and processing time semantics in one program.
How Flink’s checkpointing mechanism integrates with Kafka for fault-tolerance, for consistent stateful applications with exactly-once semantics.
We will discuss “”Savepoints””, which allows users to save the state of the streaming program at any point in time. Together with a durable event log like Kafka, savepoints allow users to pause/resume streaming programs, go back to prior states, or switch to different versions of the program, while preserving exactly-once semantics.
We explain the techniques behind the combination of low-latency and high throughput streaming, and how latency/throughput trade-off can configured.
We will give an outlook on current developments for streaming analytics, such as streaming SQL and complex event processing.
In this hands-on workshop, we'll explore how to deploy resources to azure using terraform. First we'll peek into the basics of terraform (HCL language, CLI, providers, provisioners, modules, plans, state files etc).
Then in our hand-on exercise, we'll author terraform scripts to deploy virtual networks, virtual machines and app services to azure. Finally we'll walk through some azure tooling & integrations for terraform (azure cloud shell, hosted images in azure devops, azure marketplace images, VSCode extensions etc).
Author: Mithun Shanbhag
Data Streaming with Apache Kafka & MongoDBconfluent
Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
A machine learning and data science pipeline for real companiesDataWorks Summit
Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors.
Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners?
We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works.
Speaker
Ray Harrison, Comcast, Enterprise Architect
Prashant Khanolkar, Comcast, Principal Architect Big Data
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
Amazon EMR is a managed service that makes it easy for customers to use big data frameworks and applications like Apache Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3, Amazon’s highly scalable object storage service. In this session, we will introduce Amazon EMR and the greater Apache Hadoop ecosystem, and show how customers use them to implement and scale common big data use cases such as batch analytics, real-time data processing, interactive data science, and more. Then, we will walk through a demo to show how you can start processing your data at scale within minutes.
An Introduction to the AWS Well Architected Framework - WebinarAmazon Web Services
The AWS Well-Architected Framework enables customers to understand best practices around security, reliability, performance, cost optimization and operational excellence when building systems on AWS. This approach helps customers make informed decisions and weigh the pros and cons of application design patterns for the cloud.
In this one hour webinar, you'll learn how to use the AWS Well-Architected Framework to follow guidelines and best practices for your architecture on AWS.
In this webinar, we will walk-through Canvas and Model driven Apps, and difference between them. We will learn how to create canvas apps with Dynamics 365 data source and create Model driven Apps. We will also take a look on creating entity in CDS and running and sharing Canvas and Model driven Apps.
Azure Site Recovery - BC/DR - Migrations & assessments in 60 minutes!Johan Biere
Gain high level understanding of various challenges faced by organizations in planning their Migration and BC/DR Strategy for applications in Azure and Hybrid.
Learn about the capabilities that make Azure the ideal destination for your applications, data, and infrastructure. You will get clear, scenario-based guidance on how to approach your technical migration & innovation journeys. Understand how to migrate different on-premises applications to Azure, including moving them to Azure IaaS Using ASR with Hyper-V Assessments & Agentless migration.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
How to boost your datamanagement with Dremio ?Vincent Terrasi
Works with any source. Relational, non-relational, 3rd party apps. 5 years ago nobody was using Hadoop, MongoDB, and 5 years from now there will be new products. You need a solution that is future proof.
Works with any BI tool. In every company multiple tools are in use. Each department has their favorite. We need to work with all of them.
No ETL, data warehouse, cubes. This would need to give you a really good alternative to these options.
Makes data self-service, collaborative. Probably most important of all, we need to change the dynamic between the business and IT. We need to make it so business users can get the data they want, in the shape they want it, without waiting on IT.
Makes Big Data feels small. It needs to make billions of rows feel like a spreadsheet on your desktop.
Open source. It’s 2017, so we think this has to be open source.
Introduction
Benefits
Concepts
Templates
CLI Tool
Cloud Formation Demo
Cloud Former (Intro)
Questions
The tutorial includes an introduction to Cloud formation, benefits to Cloud formation, concepts of Cloud formation, CLI tool, Cloud formation demo, introduction to Cloud former. The tutorial begins with an introduction to Cloud formation subsequent to which, there is another section talking about the benefits of Cloud formation. It also includes the services which are used by Cloud formation.
The next section is based on the concepts of Cloud formation. This section is important as it explains the concepts of Cloud formation which are template and stack. The Template section includes the description, objects, sample template, parameters, resources, types of resources and also the steps to create a template. Whereas, the Stack section includes the collection of resources, resources which are created or deleted. Afterward comes the CLI Tool. This section includes the CLI tool called CFN.
The CLI tool section is then followed by a Cloud formation demo. It not only gives a demo of Cloud formation and which templates would be useful. But, it also includes the issues which are present in the Cloud formation demo. The last section includes an introduction to Cloud former. It provides the description of Cloud former as to which tool and architecture it uses and also the things which are possible while using Cloud former.
Microservices are small services with independent lifecycles that work together. There is an underlying tension in that definition – how independent can you be when you have to be part of a whole? I’ve spent much of the last couple of years trying to understand how to find the right balance, and in this talk/tutorial I’ll be presenting the core seven principles that I think represent what makes microservices tick.
After a brief introduction of what microservices are and why they are important, we’ll spend the bulk of the time looking at the principles themselves, wherever possible covering real-world examples and technology:
- Modelled around business domain – using techniques from domain-driven design to find service boundaries leads to better team alignment and more stable service boundaries, avoiding expensive cross-service changes.
- Culture of automation – all organisations that use microservices at scale have strong cultures of automation. We’ll look at some of their stories and think about which sort of automation is key.
- Hide implementation details – how do you hide the detail inside each service to avoid coupling, and ensure each service retains its autonomous nature?
- Decentralize all the things! – we have to push power down as far as we can, and this goes for both the system and organisational architecture. We’ll look at everything from autonomous self-contained teams and internal open source, to using choreographed systems to handle long-lived business transactions.
- Deploy independently – this is all about being able to deploy safely. So we’ll cover everything from deployment models to consumer-driven contracts and the importance of separating deployment from release.
- Isolate failure – just making a system distributed doesn’t make it more stable than a monolithic application. So what do you need to look for?
- Highly observable – we need to understand the health of a single service, but also the whole ecosystem. How?
In terms of learning outcomes, beginners will get a sense of what microservices are and what makes them different, whereas more experienced practitioners will get insight and practical advice into how to implement them.
Have you prepared your AWS environment for detecting and managing security-related events? Do you have all the incident response training and tools you need to rapidly respond to, recover from, and determine the root cause of security events in the cloud? Even if you have a team of incident response rock stars with an arsenal of automated data acquisition and computer forensics capabilities, there is likely a thing or two you will learn from several step-by-step demonstrations of wrangling various potential security events within an AWS environment, from detection to response to recovery to investigating root cause. At a minimum, show up to find out who to call and what to expect when you need assistance with applying your existing, already awesome incident response runbook to your AWS environment.
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)Amazon Web Services
Amazon Redshift is a fast, simple, cost-effective data warehousing solution, and in this session, we look at the tools and techniques you can use to migrate your existing data warehouse to Amazon Redshift. We will then present a case study on Scholastic’s migration to Amazon Redshift. Scholastic, a large 100-year-old publishing company, was running their business with older, on-premise, data warehousing and analytics solutions, which could not keep up with business needs and were expensive. Scholastic also needed to include new capabilities like streaming data and real time analytics. Scholastic migrated to Amazon Redshift, and achieved agility and faster time to insight while dramatically reducing costs. In this session, Scholastic will discuss how they achieved this, including options considered, technical architecture implemented, results, and lessons learned.
State, Local and Education customers are using the AWS cloud to enable faster disaster recovery of their mission critical IT systems without incurring the infrastructure expense of a second physical site. Join us for an informative webinar on how AWS cloud supports many popular disaster recovery (DR) architectures from “pilot light” environments that are ready to scale up at a moment’s notice to “hot standby” environments that enable rapid failover. With infrastructure centers in 10 regions around the world, AWS provides a set of cloud-based DR services that enable rapid recovery of your IT infrastructure and data.
Author: Allan Hanbury, senior researcher at the Vienna University of Technology, leader of the Data Science Research Studio Austria, co-founder of ContextFlow.
Part 1: Data Market Austria, a recently-started project building a Data-Services Ecosystem in Austria.
• What is the vision for Data Market Austria?
• How is DMA planned to develop?
• What are some of the technologies behind DMA?
• How can one participate in DMA?
• What are the requests from practitioners for DMA?
Part 2: an outline of a Data Science Continuing Education course being developed at the Vienna University of Technology.
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
A machine learning and data science pipeline for real companiesDataWorks Summit
Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors.
Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners?
We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works.
Speaker
Ray Harrison, Comcast, Enterprise Architect
Prashant Khanolkar, Comcast, Principal Architect Big Data
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
Amazon EMR is a managed service that makes it easy for customers to use big data frameworks and applications like Apache Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3, Amazon’s highly scalable object storage service. In this session, we will introduce Amazon EMR and the greater Apache Hadoop ecosystem, and show how customers use them to implement and scale common big data use cases such as batch analytics, real-time data processing, interactive data science, and more. Then, we will walk through a demo to show how you can start processing your data at scale within minutes.
An Introduction to the AWS Well Architected Framework - WebinarAmazon Web Services
The AWS Well-Architected Framework enables customers to understand best practices around security, reliability, performance, cost optimization and operational excellence when building systems on AWS. This approach helps customers make informed decisions and weigh the pros and cons of application design patterns for the cloud.
In this one hour webinar, you'll learn how to use the AWS Well-Architected Framework to follow guidelines and best practices for your architecture on AWS.
In this webinar, we will walk-through Canvas and Model driven Apps, and difference between them. We will learn how to create canvas apps with Dynamics 365 data source and create Model driven Apps. We will also take a look on creating entity in CDS and running and sharing Canvas and Model driven Apps.
Azure Site Recovery - BC/DR - Migrations & assessments in 60 minutes!Johan Biere
Gain high level understanding of various challenges faced by organizations in planning their Migration and BC/DR Strategy for applications in Azure and Hybrid.
Learn about the capabilities that make Azure the ideal destination for your applications, data, and infrastructure. You will get clear, scenario-based guidance on how to approach your technical migration & innovation journeys. Understand how to migrate different on-premises applications to Azure, including moving them to Azure IaaS Using ASR with Hyper-V Assessments & Agentless migration.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
How to boost your datamanagement with Dremio ?Vincent Terrasi
Works with any source. Relational, non-relational, 3rd party apps. 5 years ago nobody was using Hadoop, MongoDB, and 5 years from now there will be new products. You need a solution that is future proof.
Works with any BI tool. In every company multiple tools are in use. Each department has their favorite. We need to work with all of them.
No ETL, data warehouse, cubes. This would need to give you a really good alternative to these options.
Makes data self-service, collaborative. Probably most important of all, we need to change the dynamic between the business and IT. We need to make it so business users can get the data they want, in the shape they want it, without waiting on IT.
Makes Big Data feels small. It needs to make billions of rows feel like a spreadsheet on your desktop.
Open source. It’s 2017, so we think this has to be open source.
Introduction
Benefits
Concepts
Templates
CLI Tool
Cloud Formation Demo
Cloud Former (Intro)
Questions
The tutorial includes an introduction to Cloud formation, benefits to Cloud formation, concepts of Cloud formation, CLI tool, Cloud formation demo, introduction to Cloud former. The tutorial begins with an introduction to Cloud formation subsequent to which, there is another section talking about the benefits of Cloud formation. It also includes the services which are used by Cloud formation.
The next section is based on the concepts of Cloud formation. This section is important as it explains the concepts of Cloud formation which are template and stack. The Template section includes the description, objects, sample template, parameters, resources, types of resources and also the steps to create a template. Whereas, the Stack section includes the collection of resources, resources which are created or deleted. Afterward comes the CLI Tool. This section includes the CLI tool called CFN.
The CLI tool section is then followed by a Cloud formation demo. It not only gives a demo of Cloud formation and which templates would be useful. But, it also includes the issues which are present in the Cloud formation demo. The last section includes an introduction to Cloud former. It provides the description of Cloud former as to which tool and architecture it uses and also the things which are possible while using Cloud former.
Microservices are small services with independent lifecycles that work together. There is an underlying tension in that definition – how independent can you be when you have to be part of a whole? I’ve spent much of the last couple of years trying to understand how to find the right balance, and in this talk/tutorial I’ll be presenting the core seven principles that I think represent what makes microservices tick.
After a brief introduction of what microservices are and why they are important, we’ll spend the bulk of the time looking at the principles themselves, wherever possible covering real-world examples and technology:
- Modelled around business domain – using techniques from domain-driven design to find service boundaries leads to better team alignment and more stable service boundaries, avoiding expensive cross-service changes.
- Culture of automation – all organisations that use microservices at scale have strong cultures of automation. We’ll look at some of their stories and think about which sort of automation is key.
- Hide implementation details – how do you hide the detail inside each service to avoid coupling, and ensure each service retains its autonomous nature?
- Decentralize all the things! – we have to push power down as far as we can, and this goes for both the system and organisational architecture. We’ll look at everything from autonomous self-contained teams and internal open source, to using choreographed systems to handle long-lived business transactions.
- Deploy independently – this is all about being able to deploy safely. So we’ll cover everything from deployment models to consumer-driven contracts and the importance of separating deployment from release.
- Isolate failure – just making a system distributed doesn’t make it more stable than a monolithic application. So what do you need to look for?
- Highly observable – we need to understand the health of a single service, but also the whole ecosystem. How?
In terms of learning outcomes, beginners will get a sense of what microservices are and what makes them different, whereas more experienced practitioners will get insight and practical advice into how to implement them.
Have you prepared your AWS environment for detecting and managing security-related events? Do you have all the incident response training and tools you need to rapidly respond to, recover from, and determine the root cause of security events in the cloud? Even if you have a team of incident response rock stars with an arsenal of automated data acquisition and computer forensics capabilities, there is likely a thing or two you will learn from several step-by-step demonstrations of wrangling various potential security events within an AWS environment, from detection to response to recovery to investigating root cause. At a minimum, show up to find out who to call and what to expect when you need assistance with applying your existing, already awesome incident response runbook to your AWS environment.
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)Amazon Web Services
Amazon Redshift is a fast, simple, cost-effective data warehousing solution, and in this session, we look at the tools and techniques you can use to migrate your existing data warehouse to Amazon Redshift. We will then present a case study on Scholastic’s migration to Amazon Redshift. Scholastic, a large 100-year-old publishing company, was running their business with older, on-premise, data warehousing and analytics solutions, which could not keep up with business needs and were expensive. Scholastic also needed to include new capabilities like streaming data and real time analytics. Scholastic migrated to Amazon Redshift, and achieved agility and faster time to insight while dramatically reducing costs. In this session, Scholastic will discuss how they achieved this, including options considered, technical architecture implemented, results, and lessons learned.
State, Local and Education customers are using the AWS cloud to enable faster disaster recovery of their mission critical IT systems without incurring the infrastructure expense of a second physical site. Join us for an informative webinar on how AWS cloud supports many popular disaster recovery (DR) architectures from “pilot light” environments that are ready to scale up at a moment’s notice to “hot standby” environments that enable rapid failover. With infrastructure centers in 10 regions around the world, AWS provides a set of cloud-based DR services that enable rapid recovery of your IT infrastructure and data.
Author: Allan Hanbury, senior researcher at the Vienna University of Technology, leader of the Data Science Research Studio Austria, co-founder of ContextFlow.
Part 1: Data Market Austria, a recently-started project building a Data-Services Ecosystem in Austria.
• What is the vision for Data Market Austria?
• How is DMA planned to develop?
• What are some of the technologies behind DMA?
• How can one participate in DMA?
• What are the requests from practitioners for DMA?
Part 2: an outline of a Data Science Continuing Education course being developed at the Vienna University of Technology.
Big Data Processing in the Cloud: A Hydra/Sufia Experiencerotated8
This presentation addresses the challenge of processing big data in a cloud-based data repository. Using the Hydra Project’s Hydra and Sufia ruby gems and working with the Hydra community, we created a special repository for the project, and set up background jobs. Our approach is to create the metadata with these jobs, which are distributed across multiple computing cores. This will allow us to scale our infrastructure out on an as-needed basis, and decouples automatic metadata creation from the response times seen by the user. While the metadata is not immediately available after ingestion, it does mean that the object is. By distributing the jobs, we can compute complex properties without impacting the repository server. Hydra and Sufia allowed us to get a head start by giving us a simple self deposit repository, complete with background jobs support via Redis and Resque.
Big Data Processing in the Cloud: a Hydra/Sufia Experience
Zhiwu Xie, Ph.D., Associate Professor and Technology Development Librarian, Center for Digital Research and Scholarship University Libraries, Virginia Tech
Cloud PARTE: Elastic Complex Event Processing based on Mobile ActorsStefan Marr
Traffic monitoring or crowd management systems produce large amounts of data in the form of events that need to be processed to detect relevant incidents.
Rule-based pattern recognition is a promising approach for these applications, however, increasing amounts of data as well as large and complex rule sets demand for more and more processing power and memory. In order to scale such applications, a rule-based pattern detection system needs to be distributable over multiple machines. Today’s approaches are however focused on static distribution of rules or do not support reasoning over the full set of events.
We propose Cloud PARTE, a complex event detection system that implements the Rete algorithm on top of mobile actors. These actors can migrate between machines to respond to changes in the work load distribution. Cloud PARTE is an extension of PARTE and offers the first rule engine specifically tailored for continuous complex event detection that is able to benefit from elastic systems as provided by cloud computing platforms. It supports fully automatic load balancing and supports online rules with access to the entire event pool.
El día 21 de Septiembre, tuvimos el placer de acoger en nuestras oficinas un Meetup impartido por nuestro compañero Paco Guerrero sobre la plataforma Apache Flink.
"Apache Flink es una plataforma open source de procesamiento en tiempo real, que está en auge al ofrecer características de las que otras tecnologías con las que compite no disponen, sin impacto en su rendimiento. En esta formación introduciremos la filosofía y motor de procesamiento que hace a Flink tan especial y potente. También recorreremos los pilares básicos que confirman a Flink como la plataforma de streaming más prometedora actualmente"
Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes
Big data introduction - Sogeti - Consulting Services - Business Technology - 20130628 v5
This is a small introduction to the topic Big Data and a small vision on how to enable a (big) company in using big data and embed it into the organisation.
Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a short description of the work packages for our BIg Data project proposal that was submitted in March.
Benefícios e desafios que Big Data & Analytics traz para as empresas na jorna...Flávio Secchieri Mariotti
APICON 2017 no Brasil. Obrigado pelo convite e oportunidade de compartilhar um pouco da experiência da CSC na jornada de transformação digital com ênfase em Big Data & Analytics.
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Karthik Ramasamy
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. In order
to meet this challenge, Twitter designed an end to end real-time stack consisting of DistributedLog, the distributed and replicated messaging system system, and Heron, the streaming system for real time computation. DistributedLog is a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s publish-subscribe system. Twitter Heron is the next generation streaming system built from ground up to address our scalability and reliability needs. Both the systems have been in production for nearly two years and is widely used at Twitter in a range of diverse applications such as search ingestion pipeline, ad analytics, image classification and more. These slides will describe Heron and DistributedLog in detail, covering a few use cases in-depth and sharing the operating experiences and challenges of running large-scale real time systems at scale.
Full Video: https://www.youtube.com/watch?v=cOShsisEsC0
An overview of the relation and combination of three data processing paradigms that is becoming more relevant today. It introduces the essentials of graph, distributed and stream computing and beyond. Furthermore, it questions the fundamental problems that we want to solve with data analysis and the potential of eventually saving the human kind in the next millennium by improving the state of the art of computation technologies while being too busy answering first world problem questions. Crazy but possible.
Judd Bagley gives insight into the future of the big data revolution and where he sees the industry going in 2017. Visit Judd's website at http://www.juddbagley.com
This presentation introduces concepts of Big Data in a layman's language. Author does not claim the originality of the content. The presentation is made by compiling from various sources. Author does not claim copyrights or privacy issues.
Big data is exponentially rising in today's age of information and digital shrinkage. This presentation potentially clears the concept and revolving hype around it.
Data Mining, Predictive Analytics and Big Data - Course information Spring 2017Andrés Fortino, PhD
Invitation to an NYU online seminar for Spring 2017 - Gain an overview of the collection, analysis, and visualization of complex data, as well as the relevant pivotal concepts.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general.
This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
This talk provides a critical view on employing machine learning / deep learning methods in algorithmic trading. We highlight the particular challenges that we meet in this domain along with approaches to tackle some of these challenges in practice. Even though experience has shown that algorithmic trading using advanced machine learning can be successful, the crucial issue remains that predictive patterns utilizing market inefficiencies quickly become void as soon as competing market participants use them too. The conclusion is that the crucial advantage is – and has always been – to know more and to be faster than competitors.
Our Speaker: Dr. Ulrich Bodenhofer
MSc (applied math, Johannes Kepler University, Linz, Austria, 1996)
PhD (applied math, Johannes Kepler University, Linz, Austria, 1998)
Since June 2018: Chief Artificial Intelligence Officer at QUOMATIC.AI (Linz, Austria)
The consumer product landscape, particularly among e-commerce firms, includes a bevy of subscription-based business models. Internet and mobile phone subscriptions are now commonplace and joining the ranks are dietary supplements, meals, clothing, cosmetics and personal grooming products.
Standard metrics to diagnose a healthy consumer-brand relationship typically include customer purchase frequency and ultimately, retention of the customer demonstrated by regular purchases. If a brand notices that a customer isn’t purchasing, it may consider targeting the customer with discount offers or deploying a tailored messaging campaign in the hope that the customer will return and not “churn”.The churn diagnosis, however, becomes more complicated for subscription-based products, many of which offer multiple delivery frequencies and the ability to pause a subscription. Brands with subscription-based products need to have some reliable measure of churn propensity so they can further isolate the factors that lead to churn and preemptively identify at-risk customers.
Since the worldwide outbreak of the COVID-19 pandemic, experts all around the globe are working heavily to establish reliable forecasts for the spread of the disease. Hereby they allow decision-makers to roughly plan ahead and inform the population with estimates of what might still lie ahead. Yet, the huge jungle of different models, data and results is confusing and difficult to overlook: what models are reliable, which results can be trusted and what are the secrets behind these models?
OUR SPEAKER
Dr. Martin Bicher is chief developer of the COVID-19 modeling team around Dr. Niki Popper in dwh simulation-services GmbH, which currently supports decision-makers all around Austria with simulated forecasts, scenarios and policy evaluations. Moreover, he is a postdoctoral researcher at the Institute of Information Systems Engineering at TU Wien where he finished his PhD in Technical Mathematics.
State-of-the-art time-series prediction with continuous-time recurrent neural networks.
Neural networks with continuous-time hidden state representations have become unprecedentedly popular within the machine learning community. This is due to their strong approximation capability in modeling time-series, their adaptive computation modality, their memory and parameter efficiency. In this talk Ramin will discuss how this family of neural networks work and why they realize attractive degrees of generalizability across different application domains.
OUR SPEAKER
Ramin Hasani, PhD, Machine Learning Scientist at TU Wien, expert in robotics, including previously being a scholar MIT CSAL, presents technical aspects of continuous-time neural networks.
As more and more machines are supplied with machine learning algorithms, the question arises who is liable in cases of damage? Who is liable in case of accidents involving an autonomous driving car? Is there a difference when an autonomous lawnmower causes damage to the neighbour's property? Public interest in those questions is high, whereas legal opinions are rare and court decisions are missing. Daniel will show why it can be difficult to fit machine learning-based applications in the existing legal liability system, and what the future might look like.
This presentation will describe a new literacy: “data literacy”, the analogy with computer literacy, and reasons why this skillet will soon be as essential to all professionals as computer literacy is today. It will address the advent of decision making as the key managerial activity and the resulting democratization of AI and analytics. The presentation will address issues of mindset, as well as skillset, and the ways to acquire a basic level of data literacy to derive value from AI and ML assisted processes in one’s daily tasks.
Kaggle is one of the largest online communities for data scientists specifically known for their competitions where participants aim to solve data science challenges. Kaggle has a long history of varying types of competitions from different areas such as medicine, finance, scientific research, or sports focusing on different types of data and prediction problems such as tabular data, time series, NLP, or computer vision.
NLP in a Bank: Automated Document Reading: Yevgen Kolesnyk / Patrik Zatko / D...Vienna Data Science Group
Despite the fast pace of digitalization happening in the modern world, core processes in the banking area are still based on printed documents to a large extent. Document processing, therefore, consumes a significant amount of manpower and processing time, as well as an increasing operating risk level of the bank by being prone to human errors. In this session, you will learn how automated document processing can create a great opportunity to modernize and simplify the way modern banks work, reduce associated operation risk level, as well as reduce time and costs spent within a given process area.
The analysis of movement is an important research topic in, for example, geography, ecology, visual analytics, GIScience as well as in application domains such as urban, maritime, and aviation research. Movement data analysis requires tools for the manipulation and visualization of movement or trajectory data. This talk presents the new Python library MovingPandas.org
Armin Rabitsch's presentation on the importance of social media in the electi...Vienna Data Science Group
Armin Rabitsch from wahlbeobachtung.org discusses the importance of social media in the election campaign process, the need to monitor the online public discourse of political candidates and the advertising budget flows as part of civil society's efforts to uphold democratic institutions, and the obstacles that currently stand in the way.
Data Science Salon Vol. 3 on 21 Oct 2019: Social Media – Monitoring Their Impact on Civil Society
Martina Chichi describes Amnesty International Italy's Barometer of Hate ProjectVienna Data Science Group
Martina Chichi describes Amnesty International Italy's Barometer of Hate Project, which approaches online hate speech from a human rights perspective. Their goal is to pin down the main targets and triggers for online abuse in Italy, and determine the extent of politician accountability in the level of discourse.
Data Science Salon Vol. 3 on 21 Oct 2019: Social Media – Monitoring Their Impact on Civil Society
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
2. • Data Architect at unbelievable machine Company
• Software Engineering Background
• Jack of all Trades who also dives into Business topics,
Systems Engineering and Data Science.
• Big Data since 2011
• Cross-Industry: From Automotive to Transportation
• Other Activities
• Trainer: Hortonworks Apache Hadoop Certified Trainer
• Author: Articles and book projects
• Lector: Big Data at FH Technikum and FH Wiener Neustadt
Stefan Papp
3. Agenda
• Big Data Processing
• Evolution in Processing Big Data
• Data Processing Patterns
• Components of a Data Processing Engine
• Apache Spark
• Concept
• Ecosystem
• Apache Flink
• Concept
• Ecosystem
4. Big Data Processing Engines on a Hadoop 2.x Reference(!) Stack
HADOOP 2.x STACK
HDFS
(redundant, reliable storage)
YARN
(cluster resource management)
Batch
MapReduce
Direct
Java
Search
Solr
API
Engine
Data
Operating
System
File
System
Batch & Interactive
Tez
Script
Pig
SQL
Hive
Cascading
Java
Real-Time
Slider
NoSQL
HBase
Stream
Storm
RDD & PACT
Spark, Flink
Machine
Learning
SparkML
Other
Application
Graph
Giraph
Applications
7. IO Read Challenge: Read 500 GB Data (as a Reference)
• Assumption
• Shared nothing, plain read
• You can read 256 MB in 1.9 seconds
• Single Node
• Total Blocks in 500 GB = 1954 Blocks
• 1954 * 1,9 / 3600 = approx. 1 hour sequential read.
• A 40 node cluster with 8 HDs on each node
• 320 HDs -> 6 to 7 blocks on each disk
• 7 blocks * 1,9 = 13,3 seconds total read time
9. Data Flow Engine to Abstract Data Processing
• Provide a programming interface
• Express jobs as graphs of high-level operators
• System picks how to split each operator into task
• and where to run each task
• Solve topics such as
• Concurrency
• Fault recovery
15. Stream processor / Kappa Architecture
15
Source
Source
Consumer
Forward events
immediately to
pub/sub bus
Stream
Processor
Process at event time &
update serving layer
Messaging
System
20. Features of Data Processing Engines
• Processing Mode: Batch, Streaming, Hybrid
• Category: DC/SEP/ESP/CEP
• Delivery guarantees: at least once/exactly once
• State management: distributed snapshots/checkpoints
• Out of ordering processing: y/n
• Windowing: time-based, count-based
• Latency: low or medium
23. Diffentiation to Map Reduce
• MapReduce was designed to process shared nothing data
• Processing with data sharing:
• complex, multi-pass analytics (e.g. ML, graph)
• interactive ad-hoc queries
• real-time stream processing
• Improvements for coding:
• Less boilerplate code, richer API
• Support of various programming languages
24. Two Key Innovation of Spark
24
Execution optimization
via DAGs
Distributed data containers
(RDDs) to avoid Serialization.
Query
Input
Query
Query
25.
26. Start REPL locally, delegate execution via cluster manager
Execute on REPL
Ø ./bin/spark-shell --master local
Ø ./bin/pyspark --master yarn-
client
Execute as application
Ø ./bin/spark-submit
--class
org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000
Execute within Application
27. Components of a spark application
Driver program
• SparkContext/SparkSession as Hook to Execution
Environment
• Java, Scala, Python or R Code (REPL or App)
• Creates a DAG of Jobs and
Cluster manager
• grants executors to a Spark application
• Included: Standalone, Yarn, Mesos or local
• Custom made: e.g. Cassandra
• Distributes Jobs to executors
Executors
• Worker processes that execute tasks and store
data
Resource Manager (default port 4040)
•Supervise execution
31. SparkSession / SparkContext – the standard way to create container
• SparkSession (starting from 2.0) as Hook to the Data,
• SparkContext still available (can be created via SparkSession.sparkContext())
• Use SparkSession to create DataSets
• Use SparkContext to create RDD
• A session object knows about the execution environment
• Can be used to load data into a container
32. Operations on Collections: Transformations and Actions
val lines = sc.textFile("hdfs:///data/shakespeare/input") // Transformation
val lineLengths = lines.map(s => s.length) // Transformation
val totalLength = lineLengths.reduce((a, b) => a + b) // Action
Transformation:
• Create a new distributed data set from Source or from other data set
• Transformations are stacked until execution (Lazy Loading)
Actions:
• Trigger an Execution
• Create the most optimal execution path
34. Common Spark Actions
collect - gather results from nodes and return
first - return the first element of the RDD
take(N) - return the first N elements of the RDD
saveAsTextFile - write the RDD as a text file
saveAsSequenceFile - write the RDD as a SequenceFile
count - count elements in the RDD
countByKey - count elements in the RDD by key
foreach - process each element of an RDD
(e.g., rdd.collect.foreach(println) )
35. WordCount in Scala
val text = sc.textFile(source_file)
words = text.flatMap( line => line.split("W+") )
val kv = words.map( word => (word.toLowerCase(), 1) )
val totals = kv.reduceByKey( (v1, v2) => v1 + v2 )
totals.saveAsTextFile(output)
38. How to use SQL on Spark
• Spark SQL: Component direct on the Berkeley ecosystem
• Hive on Spark: Use Spark as execution engine for hive
• BlinkDB: Approximate SQL Engine
39. Spark SQL
Spark SQL uses DataFrames (Typed Data Containers) for SQL
Hive:
c = HiveContext(sc)
rows = c.sql(“select * from titanic”)
rows.filter(rows[‘age’] > 25).show()
JSON:
c.read.format(‘json’).load(’file:///root/tweets.json”).registerTe
mpTable(“tweets”)
c.sql(“select text, user.name from tweets”)
39
28.01.17
41. BlinkDB
• An approximate query engine for running interactive SQL queries.
• allows to trade-off query accuracy for response time,
• enabling interactive queries over massive data by running queries on data samples and
presenting results annotated with meaningful error bars.
45. Typical Use Cases
Classification and regression
• Linear support vector machine
• Logistic regression
• Linear least squares, Lasso, ridge regression
• Decision tree
• Naive Bayes
Collaborative filtering
• Alternating least squares
Clustering
• K-means
Dimensionality reduction
• Singular value decomposition
• Principal component analysis
Optimization
• Stochastic gradient descent
• Limited-memory BFGS
http://spark.apache.org/docs/latest/mllib-guide.html
46. MLLIb and H2O
• DataBricks-ML Libraries: inspired by the sci-kit learn library.
• MLLIB works with RDDs
• ML works with DataFrames
• H2O- library: Library build by the company H2O.
• H2O can be integrated with Spark with the 'Sparkling Water' connector.
47. Graph Analytics
Graph Engine that analyzes tabular data
• Nodes: People and things (nouns/keys)
• Edges: relationships between nodes
Algorithms
PageRank
Connected components
Label propagation
SVD++
Strongly connected components
Triangle count
One Framework per Container-API
•GraphX is designed for RDDs,
•GraphFrames for DataFrames
49. 49
Streaming: continuous processing on
data that is continuously produced
Sources
Message
Broker
Stream processor
collect publish/subscribe analyse serve&store
56. Building windows from a stream
“Number of visitors in the last 5 minutes per country”
56
source
Kafka topic
Stream processor
// create stream from Kafka source
DataStream<LogEvent> stream = env.addSource(new KafkaConsumer());
// group by country
DataStream<LogEvent> keyedStream = stream.keyBy(“country“);
// window of size 5 minutes
keyedStream.timeWindow(Time.minutes(5))
// do operations per window
.apply(new CountPerWindowFunction());
59. Window types in Flink
Tumbling windows
Sliding windows
Custom windows with window assigners, triggers and evictors
59
Further reading: http://flink.apache.org/news/2015/12/04/Introducing-windows.html
60. 1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
Event Time vs. Processing Time
60
62. Batch vs. Continuous
62
• No state across batches
• Fault tolerance within a job
• Re-processing starts empty
Batch Jobs
Continuous
Programs
• Continuous state across time
• Fault tolerance guards state
• Reprocessing starts stateful
64. Re-processing data (continuous)
• Draw savepoints at times that you will want to start new jobs from (daily, hourly, …)
• Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)
64
Savepoint
Run new streaming
program from savepoint
65. Stream processor: Flink
Managed state in Flink
• Flink automatically backups and restores state
• State can be larger than the available memory
• State back ends: (embedded) RocksDB, Heap memory
65
Operator with windows
(large state)
State backend(local)
Distributed File
System
Periodic backup /
recovery
Source Kafka
67. Fault tolerance in streaming
• How do we ensure the results are always correct?
• Failures should not lead to data loss or incorrect results
67
Source
Kafka
topic
Stream processor
68. Fault tolerance in streaming
• At least once: ensure all events are transmitted
• May lead to duplicates
• At most once: ensure that a known state of data is transmitted
• May lead to data loss
• Exactly once: ensure that operators do not perform duplicate updates to their state
• Flink achieves exactly once with Distributed Snapshots
70. Yahoo! Benchmark
• Count ad impressions grouped by campaign
• Compute aggregates over a 10 second window
• Emit window aggregates to Redis every second for query
70
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-
streaming-computation-engines-at
“Storm […] and Flink […] show sub-second latencies at relatively high
throughputs with Storm having the lowest 99th percentile latency.
Spark streaming 1.5.1 supports high throughputs, but at a relatively higher
latency.”
(Quote from the blog post’s executive summary)
71. Windowing with state in Redis
• Original use case did not use Flink’s windowing implementation.
• Data Artisans implemented the use case with Flink windowing.
71
KafkaConsumer
map()
filter()
group
Flink event
time windows
realtime queries
73. Can we even go further?
73
KafkaConsumer
map()
filter()
group
Flink event
time windows
Network link to Kafka
cluster is bottleneck!
(1GigE)
Data Generator
map()
filter()
group
Flink event
time windows
Solution: Move data
generator into job (10
GigE)
75. Survival of the Fastest – Flink Performance
• throughput of 15 million messages/second on 10 machines
• 35x higher throughput compared to Storm (80x compared to Yahoo’s runs)
• exactly once guarantees
• Read the full report: http://data-artisans.com/extending-the-yahoo-streaming-benchmark/