This document provides an agenda and slides for a presentation on introducing big data concepts using open source tools. The presentation covers ingesting and analyzing sample data using Spark SQL, including joining datasets to count the number of books by author. It also demonstrates basic machine learning by loading sample revenue data, applying data quality rules to correct anomalies, and using linear regression to predict revenue for a party of 40 guests. The goal is to make big data concepts accessible to audiences of all experience levels.
"Big Data made easy with a Spark" is the presentation I gave for ATO (AllThingsOpen) 2018.
In this hands-on session, you will learn how to do a full Big Data scenario from ingestion to publication. You will see how we can use Java and Apache Spark to ingest data, perform some transformations, save the data. You will then perform a second lab where you will run your very first Machine Learning algorithm!
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...Laura M. Castro
Property-based testing is an already known testing methodology for the Erlang community, with tools such as QuickCheck and PropEr being highly popular among Erlang developers in the last few years. However, they are commonly used for functional testing... Which are the challenges in using them for testing non-functional properties of software? What other tools or libraries are there to help Erlang developers?
In graph we trust: Microservices, GraphQL and security challengesMohammed A. Imran
In graph we trust: Microservices, GraphQL and security challenges - Mohammed A. Imran
Microservices, RESTful and API-first architectures are rage these days and rightfully so, they solve some of the challenges of modern application development. Microservices enable organisations in shipping code to production faster and is accomplished by dividing big monolithic applications into smaller but specialised applications. Though they provide great benefits, they are difficult to debug and secure in complex environments (different API versions, multiple API calls and frontend/backend gaps etc.,). GraphQL provides a powerful way to solve some of these challenges but with great power, comes great responsibility. GraphQL reduces the attack surface drastically(thanks to LangSec) but there are still many things which can go wrong.
This talk will cover the risks associated with GraphQL, challenges and solutions, which help in implementing Secure GraphQL based APIs. We will start off with introduction to GraphQL and its benefits. We then discuss the difficulty in securing these applications and why traditional security scanners don’t work with them. At last, we will cover solutions which help in securing these API by shifting left in DevOps pipeline.
We will cover the following as part of this presentation:
GraphQL use cases and how unicorns use them
Benefits and security challenges with GraphQL
Authentication and Authorisation
Resource exhaustion
Backend complexities with microservices
Need for tweaking conventional DevSecOps tools for security assurance
Security solutions which works with GraphQL
4Developers 2015: Lessons for Erlang VM - Michał ŚlaskiPROIDEA
Speaker: Michał Ślaski
Language: English
YouTube: https://www.youtube.com/watch?v=mCIFPAG1u0k&list=PLnKL6-WWWE_WNYmP_P5x2SfzJ7jeJNzfp&index=56
Goal
Share experiences from using Erlang/OTP in real projects and discuss Erlang VM features, which help maintain and troubleshoot ever running systems.
Details
In this talk we will discuss thinking in a concurrent programming language along with the problems it has to solve.
We will take a look at the real tools Erlang gives you to write highly-available systems that let you sleep through the night: dynamic tracing and remote REPL.
We'll see how these tools are used.
Prerequisites
Attendees should be familiar with functional programming, actor based concurrency, pattern matching, dynamic typing, TCP/IP.
4Developers: http://4developers.org.pl/pl/
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and its important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.
Figuring out when things have gone terribly wrong is trickier than it first appears, since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data.
For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF.
If you’ve seen Holden’s previous testing Spark talks this can be viewed as a deep dive on the second half focused around what else we need to do besides good testing practices to create production quality pipelines. If you haven’t seen the testing talks watch those on YouTube after you come see this one
Michael Choi's process for designing web application(s), including which programming language to use, when to use Node.js, when to use a light-weight framework vs a heavy MVC framework, how to set up git for collaboration based on complexity of the project, how a tool like Jenkins can be used for continuous integration, continuous delivery, and continuous deployment, where to host the data, what services to use for orchestrating containers or servers.
"Big Data made easy with a Spark" is the presentation I gave for ATO (AllThingsOpen) 2018.
In this hands-on session, you will learn how to do a full Big Data scenario from ingestion to publication. You will see how we can use Java and Apache Spark to ingest data, perform some transformations, save the data. You will then perform a second lab where you will run your very first Machine Learning algorithm!
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...Laura M. Castro
Property-based testing is an already known testing methodology for the Erlang community, with tools such as QuickCheck and PropEr being highly popular among Erlang developers in the last few years. However, they are commonly used for functional testing... Which are the challenges in using them for testing non-functional properties of software? What other tools or libraries are there to help Erlang developers?
In graph we trust: Microservices, GraphQL and security challengesMohammed A. Imran
In graph we trust: Microservices, GraphQL and security challenges - Mohammed A. Imran
Microservices, RESTful and API-first architectures are rage these days and rightfully so, they solve some of the challenges of modern application development. Microservices enable organisations in shipping code to production faster and is accomplished by dividing big monolithic applications into smaller but specialised applications. Though they provide great benefits, they are difficult to debug and secure in complex environments (different API versions, multiple API calls and frontend/backend gaps etc.,). GraphQL provides a powerful way to solve some of these challenges but with great power, comes great responsibility. GraphQL reduces the attack surface drastically(thanks to LangSec) but there are still many things which can go wrong.
This talk will cover the risks associated with GraphQL, challenges and solutions, which help in implementing Secure GraphQL based APIs. We will start off with introduction to GraphQL and its benefits. We then discuss the difficulty in securing these applications and why traditional security scanners don’t work with them. At last, we will cover solutions which help in securing these API by shifting left in DevOps pipeline.
We will cover the following as part of this presentation:
GraphQL use cases and how unicorns use them
Benefits and security challenges with GraphQL
Authentication and Authorisation
Resource exhaustion
Backend complexities with microservices
Need for tweaking conventional DevSecOps tools for security assurance
Security solutions which works with GraphQL
4Developers 2015: Lessons for Erlang VM - Michał ŚlaskiPROIDEA
Speaker: Michał Ślaski
Language: English
YouTube: https://www.youtube.com/watch?v=mCIFPAG1u0k&list=PLnKL6-WWWE_WNYmP_P5x2SfzJ7jeJNzfp&index=56
Goal
Share experiences from using Erlang/OTP in real projects and discuss Erlang VM features, which help maintain and troubleshoot ever running systems.
Details
In this talk we will discuss thinking in a concurrent programming language along with the problems it has to solve.
We will take a look at the real tools Erlang gives you to write highly-available systems that let you sleep through the night: dynamic tracing and remote REPL.
We'll see how these tools are used.
Prerequisites
Attendees should be familiar with functional programming, actor based concurrency, pattern matching, dynamic typing, TCP/IP.
4Developers: http://4developers.org.pl/pl/
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and its important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.
Figuring out when things have gone terribly wrong is trickier than it first appears, since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data.
For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF.
If you’ve seen Holden’s previous testing Spark talks this can be viewed as a deep dive on the second half focused around what else we need to do besides good testing practices to create production quality pipelines. If you haven’t seen the testing talks watch those on YouTube after you come see this one
Michael Choi's process for designing web application(s), including which programming language to use, when to use Node.js, when to use a light-weight framework vs a heavy MVC framework, how to set up git for collaboration based on complexity of the project, how a tool like Jenkins can be used for continuous integration, continuous delivery, and continuous deployment, where to host the data, what services to use for orchestrating containers or servers.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
A short introduction to the more advanced python and programming in general. Intended for users that has already learned the basic coding skills but want to have a rapid tour of more in-depth capacities offered by Python and some general programming background.
Execrices are available at: https://github.com/chiffa/Intermediate_Python_programming
De-mystifying contributing to PostgreSQLLætitia Avrot
PostgreSQL is a great community. They are open-minded, friendly, agreeable and so on. You feel like helping them.
The problem is you are shy and you look at community people as gods. On top of that you don't want to mess up with their work or bother them with obvious and silly (to them) questions!
This conference talk is based on my own true story. I will tell you about how I submitted my very first patch to the community. After some background presentation about how the community works, I will try to answer the following questions:
What can I do to help (and you'll see that even without coding you can do a lot!)?
What's a contribution?
What's a patch? How can I create one?
And I hope that sooner or later you'll come and join the community and you'll feel so proud of yourselves!
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...NETWAYS
This talk is for the developers who want to learn best practices in using Terraform at companies and projects of various size (from small to very large), get pros&cons on code structuring, compositions, tools. Also, attendees will be able to learn Terraform (and Terragrunt) tricks and gotchas.
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Ted Drake
Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.
Apache Toree provides the interactive notebook for Spark/Scala. Toree is a IPython/Jupyter kernel. It lets you mix Spark/Scala code with markdown, execute the notebook, and publish it on the web.
Asim will talk about how to install and get started with Apache Toree, how to use it to develop Spark applications interactively in notebooks, and how to publish your notebooks.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
I strongly believe in the combination of Apache Spark with Java. In this tutorial, prepared for NCDevCon, we are going through the basics of Spark as well as 2 examples: a basic ingestion and an analytics example based on joins & group by. Follow me @jgperrin.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
A short introduction to the more advanced python and programming in general. Intended for users that has already learned the basic coding skills but want to have a rapid tour of more in-depth capacities offered by Python and some general programming background.
Execrices are available at: https://github.com/chiffa/Intermediate_Python_programming
De-mystifying contributing to PostgreSQLLætitia Avrot
PostgreSQL is a great community. They are open-minded, friendly, agreeable and so on. You feel like helping them.
The problem is you are shy and you look at community people as gods. On top of that you don't want to mess up with their work or bother them with obvious and silly (to them) questions!
This conference talk is based on my own true story. I will tell you about how I submitted my very first patch to the community. After some background presentation about how the community works, I will try to answer the following questions:
What can I do to help (and you'll see that even without coding you can do a lot!)?
What's a contribution?
What's a patch? How can I create one?
And I hope that sooner or later you'll come and join the community and you'll feel so proud of yourselves!
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
OSDC 2019 | Terraform best practices with examples and arguments by Anton Bab...NETWAYS
This talk is for the developers who want to learn best practices in using Terraform at companies and projects of various size (from small to very large), get pros&cons on code structuring, compositions, tools. Also, attendees will be able to learn Terraform (and Terragrunt) tricks and gotchas.
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Ted Drake
Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.
Apache Toree provides the interactive notebook for Spark/Scala. Toree is a IPython/Jupyter kernel. It lets you mix Spark/Scala code with markdown, execute the notebook, and publish it on the web.
Asim will talk about how to install and get started with Apache Toree, how to use it to develop Spark applications interactively in notebooks, and how to publish your notebooks.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
I strongly believe in the combination of Apache Spark with Java. In this tutorial, prepared for NCDevCon, we are going through the basics of Spark as well as 2 examples: a basic ingestion and an analytics example based on joins & group by. Follow me @jgperrin.
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk, we’ll mention all of the aspects that you should take into consideration when monitoring a distributed system using tools like Web Services, Spark, Cassandra, MongoDB, AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneDatabricks
Organizations are developing deep learning applications to derive new insights, identify new opportunities and uncover new efficiencies. However, deep learning application development often means tapping into multiple frameworks, libraries, and clusters—a complex, time-consuming, and costly effort. This keynote will discuss what the newly released BigDL (open source distributed deep learning framework for Apache Spark and Intel® Xeon® clusters) can offer to developers and what solutions Intel has enabled for customers and partners. In addition, plans for expanding BigDL ecosystem will also be highlighted.
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
In Data Engineer's Lunch 90, Eric Ramseur teaches our audience how to use Arcion.
From best practices to real-world examples, this talk will provide you with the knowledge and insights you need to ensure a successful migration of your SQL data. So whether you're new to data migration or looking to improve your existing process, join us and discover how Arcion can help you achieve your goals.
As data science workloads grow, so does their need for infrastructure. But, is it fair to ask data scientists to also become infrastructure experts? If not the data scientists, then, who is responsible for spinning up and managing data science infrastructure? This talk will address the context in which ML infrastructure is emerging, walk through two examples of ML infrastructure tools for launching hyperparameter optimization jobs, and end with some thoughts for building better tools in the future.
Originally given as a talk at the PyData Ann Arbor meetup (https://www.meetup.com/PyData-Ann-Arbor/events/260380989/)
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
Come può .NET contribuire alla Data Science? Cosa è .NET Interactive? Cosa c'entrano i notebook? E Apache Spark? E il pythonismo? E Azure? Vediamo in questa sessione di mettere in ordine le idee.
Project Flogo: An Event-Driven Stack for the EnterpriseLeon Stigter
In today's world everyone is building apps, most times those apps are event-driven and react to what happens around them. How do you take those apps to, let's say, a Kubernetes cluster, or let them communicate between cloud and on-premises, and how can developers and non-developers work together using the same tools?
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysDemi Ben-Ari
To defend against attacks, think like a hacker. But does that mean you need to be a DevOps expert? Security researchers today need to discover new attack techniques. However, much of their focus is diverged to backend coding. We share how to build an infrastructure for researchers that allows them concentrate on business logic and writing hacker “tasks”. Using Docker and Kubernetes on Google Cloud, these tasks can then be performed in parallel and without a lot of DevOps hassle. Our technique removes two common barriers: first, long and risky deployment processes and second, low transparency within the production system.
Promise to share the stupid things too.
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
An introduction to data engineering & data science using Apache Spark and Java.
Get Spark in Action 2e, at http://jgp.ai/sia.
In this presentation, I start by loading a few CSV files in Spark (ingestion) and displaying them through the help of this new tool I build, dṛṣṭi.
As you can expect, I clean the data, join it, transform it, and continue to visualize it through dṛṣṭi.
I use Delta Lake to create a cache for my data and explain what imputation is and show I can use imputation on my datasets to add the missing datapoints.
I then use Spark on simple linear regressions to predict/forecast data.
dṛṣṭi is open source (Apache 2 license) and is available at: https://github.com/jgperrin/ai.jgp.drsti.
All the labs are available at https://github.com/jgperrin/ai.jgp.drsti-spark.
Apache Spark v3 is a new milestone for the Big Data framework. In this session, you will (re)discover what Spark is, learn about the new features in its third major version, and go through a complete end-to-end project.
I like to call Spark an Analytics Operating Systems. It is offering far more than just a framework or a library. I will explain why. Spark v3 is the latest major evolution. It was released mid-June 2020 and adds impressive new features. After looking at them from a high level, I will detail a few of my favorites.
Finally, as we all like code (well, at least I do), I will demonstrate a complete data & AI pipeline looking at Covid-19 data.
Key takeaways: Spark as an Analytics OS, Spark v3 highlights, building data/AI pipelines/models with Spark.
Audience: software engineers, data engineers, architects, data scientists.
Those slides were used for NC Tech's lunch and learn on Aug. 22 2018.
This lunch and learn, hosted by Veracity Solutions, you will learn how Spark can help your business build a pragmatic technology roadmap to AI (Artificial Intelligence), Machine Learning, and Big Data analytics. Apache Spark is a wonderful platform for distributed data processing and analytics, but how is it used by different organizations? How difficult is it to on-board a team, what technology do they need to master before on-boarding, do they have to master Scala or simply use their Java skills? You will find answers to those questions, get a realistic perspective on the platform, and see code (because we are all a bit geeks, right?)
Full link to the event: https://www.nctech.org/events/event/2018/lunch-and-learn-august22.html.
Spark Summit Europe Wrap Up and TASM State of the CommunityJean-Georges Perrin
On 12/12, we held our Spark meetup at IBM, called Winter 3x30. Those are the slides I used for both introducing the state of our community, TASM (Triangle Apache Spark Meetup) as well as a Spark Summit Europe Wrap Up.
As I went to Spark Summit in San Francisco, early June, I wanted to share key takeaways from the conference with my local friends of the Triangle Apache Spark Meetup.
Used for teaching HTML to middle school children (6th, 7th, and 8th graders) in a "game way" with some immediate gratification. Feedback much appreciated: jgp@jgp.net.
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
On July 9th 2015, 2CRSI announced its latest storage system: 2U24NVMe, which features 24 NVMe SSD drives, which are individually 10 to 12 times faster than SATA/SAS SSD. Jean Georges Perrin, 2CRSI Corporation's COO introduces you to this wonderful solution... and more. This presentation was given first on July 13th 2015 at the ISC HPC conference in Frankfurt, Germany.
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
Vision d'une stratégie d'utilisation de l'OpenData avec définition, éco-système, freins et solutions possibles pour lever ces freins.
Proposition de la création d'un consortium d'acteurs privés & publics.
Présentation par Jean Georges Perrin, GreenIvory (http://greenivory.fr/) dans le cadre d'un atelier Rhenatic (http://www.rhenatic.eu/).
Presentation done for the AdriaUG on May 23rd 2012 in Zagreb, Croatia.
This is an updated version of the presentation done in 2010 at the IIUG conference in Overland Park, KS, USA.
Version de la présentation utilisée pour les DCF (Dirigeants Commerciaux de France) le 9 janvier 2012 près de Colmar, Alsace.
Adapté de la présentation faite à la CCI Alsace de Strasbourg en octobre 2011.
Conférence faite à la CCI de Strasbourg le 11 octobre 2011, pour illustrer le fait de mieux utiliser son site web pour mieux vendre.
Les exemples sont des réalisations mettant en oeuvre les technologies de GreenIvory.
Découvrir GreenIvory:
http://greenivory.fr/
Découvrir nos success stories:
http://greenivory.fr/success-stories.html
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
Conférence de Jean-Georges Perrin (GreenIvory) à la CCI SAM (Sud Alsace - Mulhouse), organisée par Martine Zussy.
Sujets abordés: Web social, référencement (SEO), SMO...
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
Présentation de Jean-Georges Perrin (CEO de GreenIvory) sur la mise en place d'une stratégie éditoriale et d'autres exemples d'utilisation de MashupXFeed. Détail sur les fermes de contenu.
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
Présentation de Présentation de Xavier-Noël Cullmann (Technico-Commercial Activis) sur les bénéfices de MashupXFeed dans le cadre de l'utilisation pour du référencement. Focus sur le duplicate content.
Slides utilisés lors de la conférence organisée le 11 avril 2011 (Illkirch) lors du débat sur les tendances du Web 2.0 organisé par GreenIvory et ENSIIE à l'ISU.
Cet événement a regroupé plus de 120 personnes. Pour continuer à en discuter: http://blog.greenivory.fr/2011/04/13/retour-sur-la-conference-a-la-decouverte-des-nouvelles-tendances-du-web/.
#w2e #w2esxb
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
3. News…
๏ Director of Engineering for WeExperience
๏ Hiring a team of talented engineers to work with us
๏ Front end
๏ Mobile
๏ Back end & data
๏ AI
๏ Shoot at @jgperrin
9. ๏ What is Big Data?
๏ What is. ?
๏ What can I do with. ?
๏ What is a app, anyway?
๏ Install a bunch of software
๏ A first example
๏ Understand what just happened
๏ Another example, slightly more complex, because you are now ready
๏ But now, sincerely what just happened?
๏ Let’s do AI!
๏ Going further
Agenda
10. 3
V4
5
Biiiiiiiig Data
๏ volume
๏ variety
๏ velocity
๏ variability
๏ value
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
11. Data is
considered big
when they need
more than one
computer to be
processed
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
14. An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
15. An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
16. Some use cases
๏ NCEatery.com
๏ Restaurant analytics
๏ 1.57×10^21 datapoints analyzed
๏ Lumeris
๏ General compute
๏ Distributed data transfer/pipeline
๏ CERN
๏ Analysis of the science experiments in the LHC - Large Hadron Collider
๏ IBM
๏ Watson Data Studio
๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
๏ And much more…
17. What a typical app looks like?
Connect to the
cluster
Load Data
Do something
with the data
Share the results
20. Get all the S T U F F
๏ Go to http://jgp.net/ato2018
๏ Install the software
๏ Access the source code
21. Download some tools
๏ Java JDK 1.8
๏ http://bit.ly/javadk8
๏ Eclipse Oxygen or later
๏ http://bit.ly/eclipseo2
๏ Other nice to have
๏ Maven
๏ SourceTree or git (command line)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
http://www.eclipse.org/downloads/eclipse-packages/
24. Lab #1 - ingestion
๏ Goal
In a Big Data project, ingestion is the first operation.
You get the data “in.”
๏ Source code
https://github.com/jgperrin/
net.jgp.books.spark.ch01
25. Getting deeper
๏ Go to net.jgp.books.spark.ch01
๏ Open CsvToDataframeApp.java
๏ Right click, Run As, Java Application
26. +---+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link|
+---+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
+---+--------+--------------------+-----------+--------------------+
only showing top 5 rows
27. package net.jgp.books.sparkWithJava.ch01;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class CsvToDataframeApp {
public static void main(String[] args) {
CsvToDataframeApp app = new CsvToDataframeApp();
app.start();
}
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a dataframe
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.load("data/books.csv");
// Shows at most 5 rows from the dataframe
df.show(5);
}
}
/jgperrin/net.jgp.books.sparkWithJava.ch01
30. Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 1 -
HW
Node 2 -
HW
Node 3 -
HW
Node 4 -
HW
Spark SQL Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Node 5 -
OS
Node 5 -
HW
Your application
…
…
Unified API
Node 6 -
OS
Node 6 -
HW
Node 7 -
OS
Node 7 -
HW
Node 8 -
OS
Node 8 -
HW
31. Spark SQL
Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Your application
Dataframe
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 5 -
OS
…
Node 6 -
OS
Node 7 -
OS
Node 8 -
OS
Unified API
32. Title Text Spark SQL
Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Dataframe
33. Lab #2 - a bit of analytics
But really just a bit
34. Lab #2 - a little bit of analytics
๏ Goal
From two datasets, one containing books, the other
authors, list the authors with most books, by
number of books
๏ Source code
https://github.com/jgperrin/net.jgp.labs.spark
35. If it was in a relational database
books.csv
authors.csv
id: integer
name: string
link: string
wikipedia: string
id: integer
authorId: integer
title: string
releaseDate: string
link: string
36. Basic analytics
๏ Go to net.jgp.labs.spark.l200_join.l030_count_books
๏ Open AuthorsAndBooksCountBooksApp.java
๏ Right click, Run As, Java Application
44. Popular beliefs
๏ Robot with human-like behavior
๏ HAL from 2001
๏ Isaac Asimov
๏ Potential ethic problems
General AI Narrow AI
๏ Lots of mathematics
๏ Heavy calculations
๏ Algorithms
๏ Self-driving cars
Current state-of-the-art
45. Title Text
I am an expert in
general AI
ARTIFICIAL INTELLIGENCE
is Machine Learning
46. ๏ Common algorithms
๏Linear and logistic regressions
๏Classification and regression trees
๏K-nearest neighbors (KNN)
๏Deep learning
๏Subset of ML
๏Artificial neural networks (ANNs)
๏Super CPU intensive, use of GPU
Machine learning
47. There are two kinds of data scientists:
1) Those who can extrapolate from incomplete data.
48. Title TextDATA
Engineer
DATA
Scientist
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for
predictive models.
Explore data to find
hidden gems and
patterns.
Tells stories to key
stakeholders.
49. Title Text
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
52. Lab #3 - projecting data
๏ Goal
As a restaurant manager, I want to predict how
much revenue will bring a party of 40
๏ Source code
https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
57. Using existing data quality rules
package net.jgp.labs.sparkdq4ml.dq.udf;
import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;
public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
If price is ok, returns price,
if price is ko, returns -1
58. Telling Spark to use my DQ rules
SparkSession spark = SparkSession.builder()
.appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
spark.udf().register(
"priceCorrelationRule",
new PriceCorrelationDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
59. Loading my dataset
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
65. Format the data for ML
๏ Convert/Adapt dataset to Features and Label
๏ Required for Linear Regression in MLlib
๏Needs a column called label of type double
๏Needs a column called features of type VectorUDT
66. Format the data for ML
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));
// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
68. (the complex ML code)
LinearRegression lr = new LinearRegression()
.setMaxIter(40)
.setRegParam(1)
.setElasticNetParam(1);
LinearRegressionModel model = lr.fit(df);
Double feature = 40.0;
Vector features = Vectors.dense(40.0);
double p = model.predict(features);
/jgperrin/net.jgp.labs.sparkdq4ml
Define algorithms and its (hyper)parameters
Created a model from our data
Apply the model to a new dataset: predict
69. It’s all about the base model
Same model
Trainer ModelDataset #1
ModelDataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase
71. A (Big) Data Scenario
Data
Raw
Data
Ingestion
DataQuality
Pure
Data
Transformation
Rich
Data
Load/Publish
Data
72. Key takeaways
๏ Big Data is easier than one could think
๏ Java is the way to go (or Python)
๏ New vocabulary for using Spark
๏ You have a friend to help (ok, me)
๏ Spark is fun
๏ Spark is easily extensible
73. Going further
๏ Contact me @jgperrin
๏ Join the Spark User mailing list
๏ Get help from Stack Overflow
๏ fb.com/TriangleSpark
๏ Start a Spark meetup in Columbia, SC?
74. Going further
Spark in action (Second edition, MEAP)
by Jean Georges Perrin
published by Manning
http://jgp.net/sia
sprkans-681D
sprkans-7538
ctwopen10119
40% off
One
two free books