Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
[FrontDays'2017] Леонид Блохин (Big Data Engineer): Мист. Сервис для работы с...Provectus
This document discusses Mist, an open source platform for exposing Apache Spark jobs through REST APIs. Mist allows users to work with predictive services through a higher level of abstraction without needing to manage low-level Spark configurations. It supports Spark SQL, MLlib and GraphX jobs written in Scala or Python. Mist handles Spark context orchestration, provides real-time low latency model serving, and supports recovery of jobs after failures. The document demonstrates how to configure Mist and develop Spark jobs to run on the platform through REST or MQTT APIs. Future plans include adding support for Apache Kafka streaming and AMQP.
This document summarizes Mike Krieger's talk on scaling Instagram from its early days with 2 engineers to supporting over 30 million users. Some key points include: starting simply with Django and PostgreSQL; adopting Redis for caching and queuing; implementing database sharding in PostgreSQL as user growth increased database size; focusing on simplicity, monitoring, and nimble iteration; and scaling components individually while maintaining a minimal overall architecture. Krieger emphasizes optimizing for operational simplicity and solving problems with existing tools before building custom solutions.
This document summarizes Mike Krieger's talk on scaling Instagram from its early days with 2 engineers to supporting over 30 million users. Some key points include: starting simply with Django and PostgreSQL; adopting Redis for caching and queuing; implementing database sharding in PostgreSQL as user growth increased database size; using a variety of tools like Nginx, HAProxy, Memcached and monitoring with Munin and StatsD; and focusing on simplicity, instrumentation, and nimble iteration to adapt as needs changed.
The document discusses the future of data science, including increased use of functional programming, cloud notebooks, and probabilistic modeling of large and diverse datasets from IoT devices, drones, and satellites. It also predicts data scientists will displace traditional product managers as data becomes more important for decision making. Overall, the future involves analyzing exponentially larger volumes of diverse data using scalable cloud tools and probabilistic algorithms.
I gave a series of Seminars at the following colleges in Solapur.
1. Walchand Institute of Technology, Solapur.
2. Brahmdevdada Mane Institute of Technology, Solapur.
3. Orchid College of Engineering & Technology, Solapur.
4. SVERI's College of Engineering, Pandharpur.
It focussed on what 'BigData' is and how the next generation of professionals should be ready the BigData revolution
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
[FrontDays'2017] Леонид Блохин (Big Data Engineer): Мист. Сервис для работы с...Provectus
This document discusses Mist, an open source platform for exposing Apache Spark jobs through REST APIs. Mist allows users to work with predictive services through a higher level of abstraction without needing to manage low-level Spark configurations. It supports Spark SQL, MLlib and GraphX jobs written in Scala or Python. Mist handles Spark context orchestration, provides real-time low latency model serving, and supports recovery of jobs after failures. The document demonstrates how to configure Mist and develop Spark jobs to run on the platform through REST or MQTT APIs. Future plans include adding support for Apache Kafka streaming and AMQP.
This document summarizes Mike Krieger's talk on scaling Instagram from its early days with 2 engineers to supporting over 30 million users. Some key points include: starting simply with Django and PostgreSQL; adopting Redis for caching and queuing; implementing database sharding in PostgreSQL as user growth increased database size; focusing on simplicity, monitoring, and nimble iteration; and scaling components individually while maintaining a minimal overall architecture. Krieger emphasizes optimizing for operational simplicity and solving problems with existing tools before building custom solutions.
This document summarizes Mike Krieger's talk on scaling Instagram from its early days with 2 engineers to supporting over 30 million users. Some key points include: starting simply with Django and PostgreSQL; adopting Redis for caching and queuing; implementing database sharding in PostgreSQL as user growth increased database size; using a variety of tools like Nginx, HAProxy, Memcached and monitoring with Munin and StatsD; and focusing on simplicity, instrumentation, and nimble iteration to adapt as needs changed.
The document discusses the future of data science, including increased use of functional programming, cloud notebooks, and probabilistic modeling of large and diverse datasets from IoT devices, drones, and satellites. It also predicts data scientists will displace traditional product managers as data becomes more important for decision making. Overall, the future involves analyzing exponentially larger volumes of diverse data using scalable cloud tools and probabilistic algorithms.
I gave a series of Seminars at the following colleges in Solapur.
1. Walchand Institute of Technology, Solapur.
2. Brahmdevdada Mane Institute of Technology, Solapur.
3. Orchid College of Engineering & Technology, Solapur.
4. SVERI's College of Engineering, Pandharpur.
It focussed on what 'BigData' is and how the next generation of professionals should be ready the BigData revolution
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
NetGuardians is executing it's Big Data Analytics Platform on three key Big Data components underneath: ElasticSearch, Apache Mesos and Apache Spark. This is a presentation of the behaviour of this software stack.
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPRMongoDB
No matter if you're thinking of migrating to MongoDB, or need to meet legal requirements for an existing on-prem cluster, this talk has you covered. We start with the basics of replication and sharding and quickly scale up, covering everything you need to know to control your data, and keep it safe from unexpected data loss or downtime - a well-designed MongoDB cluster should have no single point of failure. Learn how others are “stretching” what’s possible but why you shouldn't! I'll present real-world examples from my life in the field in Europe and beyond.
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Peter Marshall, Technology Evangelist at Imply
Abstract: Apache Druid® can revolutionise business decision-making with a view of the freshest of fresh data in web, mobile, desktop, and data science notebooks. In this talk, we look at key activities to integrate into Apache Druid POCs, discussing common hurdles and signposting to important information.
Bio: Peter Marshall (https://petermarshall.io) is an Apache Druid Technology Evangelist at Imply (http://imply.io/), a company founded by original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
Typesafe has launched Spark support for Mesosphere's Data Center Operating System (DCOS). Typesafe engineers are contributing to the Mesos support for Spark and Typesafe will provide commercial support for Spark development and production deployment on Mesos. Mesos' flexibility allows many frameworks like Spark to run on top of it. This document discusses Spark on Mesos in coarse-grained and fine-grained modes and some features coming soon like dynamic allocation and constraints.
This document provides an agenda and overview for an introductory Spark development class. The class will cover the history of big data and Spark, RDD fundamentals, the Databricks UI, transformations and actions, DataFrames, Spark UIs, and resource managers. It includes surveys of students' backgrounds and use cases. Databricks is a platform for building data pipelines and advanced analytics with Spark.
The document is an agenda for an intro to Spark development class. It includes an overview of Databricks, the history and capabilities of Spark, and the agenda topics which will cover RDD fundamentals, transformations and actions, DataFrames, Spark UIs, and Spark Streaming. The class will include lectures, labs, and surveys to collect information on attendees' backgrounds and goals for the training.
The document provides an overview of the Spark framework for lightning fast cluster computing. It discusses how Spark addresses limitations of MapReduce-based systems like Hadoop by enabling interactive queries and iterative jobs through caching data in-memory across clusters. Spark allows loading datasets into memory and querying them repeatedly for interactive analysis. The document covers Spark's architecture, use of resilient distributed datasets (RDDs), and how it provides a unified programming model for batch, streaming, and interactive workloads.
Online classified web site Leboncoin.fr is one of the success stories of the French Web. 1/3 of the total internet population in France uses the site each month. The growth has been spectacular and swift, and was made possible by a robust and performant software platform. At the heart of the platform is a large PostgreSQL infrastructure, part of it running on some of the largest PC-class hardware available. In this presentation, we will show how we have grown our infrastructure. In particular, the amazing vertical scalability of PG will be showcased with hard numbers (IOPS, transactions/seconds, etc). We will also cover some of the hard lessons we have learned along the way, including near-disasters. Finally, we will look into how innovative features from the PostgreSQL ecosystem enable new approaches to our scalability challenge.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
This document provides an overview of Apache Spark and a hands-on workshop for using Spark. It begins with a brief history of Spark and how it evolved from Hadoop to address limitations in processing iterative tasks and keeping data in memory. Key Spark concepts are explained including RDDs, transformations, actions and Spark's execution model. New APIs in Spark SQL, DataFrames and Datasets are also introduced. The workshop agenda includes an overview of Spark followed by a hands-on example to rank Colorado counties by gender ratio using census data and both RDD and DataFrame APIs.
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
Spark Search is a personal project that integrates Lucene with Apache Spark for interactive search, analytics, and machine learning on big data. Experiments showed that indexing large datasets with Lucene directly was faster than using Solr or Elasticsearch on a single node with minimum parallelism, due to their additional overhead. Spark provides an in-memory distributed computing framework that can help address the challenges of indexing and searching big data with Lucene at scale more easily than traditional distributed search technologies. The presentation called for participation to help build out the Spark Search community and project.
Getting Started with Splunk Breakout SessionSplunk
Splunk is a software platform that allows users to search, monitor, and analyze machine-generated big data for security, IT and business intelligence. It collects data from sources like servers, networks, sensors and applications. Splunk can scale from analyzing data from a single computer to very large enterprises handling terabytes of data per day. It provides real-time operational intelligence through universal data ingestion, schema-on-the-fly indexing, and an intuitive search process.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
This document provides an overview of Apache Spark, including:
- Spark allows for fast iterative processing by keeping data in memory across parallel jobs for faster sharing than MapReduce.
- The core of Spark is the resilient distributed dataset (RDD) which allows parallel operations on distributed data.
- Spark comes with libraries for SQL queries, streaming, machine learning, and graph processing.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
Introduction to Spark: Or how I learned to love 'big data' after all.Peadar Coyle
Slides from a talk I will give in early 2016 at the Luxembourg Data Science Meetup. Aim is to give an introduction to Apache Spark, from a Machine Learning experts point of view. Based on various other tutorials out there. This will be aimed at non-specialists.
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018Jorge Lopez-Malla
Talk about how Big Data and geospatial processing worlds are merging to get the best insights.
(The presenetation with effects here: https://docs.google.com/presentation/d/1EniUHMrRR3vQaJp6q0qBdOyZxv62DcSv3-iZXpcfwOM/edit?usp=sharing)
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
NetGuardians is executing it's Big Data Analytics Platform on three key Big Data components underneath: ElasticSearch, Apache Mesos and Apache Spark. This is a presentation of the behaviour of this software stack.
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPRMongoDB
No matter if you're thinking of migrating to MongoDB, or need to meet legal requirements for an existing on-prem cluster, this talk has you covered. We start with the basics of replication and sharding and quickly scale up, covering everything you need to know to control your data, and keep it safe from unexpected data loss or downtime - a well-designed MongoDB cluster should have no single point of failure. Learn how others are “stretching” what’s possible but why you shouldn't! I'll present real-world examples from my life in the field in Europe and beyond.
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Peter Marshall, Technology Evangelist at Imply
Abstract: Apache Druid® can revolutionise business decision-making with a view of the freshest of fresh data in web, mobile, desktop, and data science notebooks. In this talk, we look at key activities to integrate into Apache Druid POCs, discussing common hurdles and signposting to important information.
Bio: Peter Marshall (https://petermarshall.io) is an Apache Druid Technology Evangelist at Imply (http://imply.io/), a company founded by original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
Typesafe has launched Spark support for Mesosphere's Data Center Operating System (DCOS). Typesafe engineers are contributing to the Mesos support for Spark and Typesafe will provide commercial support for Spark development and production deployment on Mesos. Mesos' flexibility allows many frameworks like Spark to run on top of it. This document discusses Spark on Mesos in coarse-grained and fine-grained modes and some features coming soon like dynamic allocation and constraints.
This document provides an agenda and overview for an introductory Spark development class. The class will cover the history of big data and Spark, RDD fundamentals, the Databricks UI, transformations and actions, DataFrames, Spark UIs, and resource managers. It includes surveys of students' backgrounds and use cases. Databricks is a platform for building data pipelines and advanced analytics with Spark.
The document is an agenda for an intro to Spark development class. It includes an overview of Databricks, the history and capabilities of Spark, and the agenda topics which will cover RDD fundamentals, transformations and actions, DataFrames, Spark UIs, and Spark Streaming. The class will include lectures, labs, and surveys to collect information on attendees' backgrounds and goals for the training.
The document provides an overview of the Spark framework for lightning fast cluster computing. It discusses how Spark addresses limitations of MapReduce-based systems like Hadoop by enabling interactive queries and iterative jobs through caching data in-memory across clusters. Spark allows loading datasets into memory and querying them repeatedly for interactive analysis. The document covers Spark's architecture, use of resilient distributed datasets (RDDs), and how it provides a unified programming model for batch, streaming, and interactive workloads.
Online classified web site Leboncoin.fr is one of the success stories of the French Web. 1/3 of the total internet population in France uses the site each month. The growth has been spectacular and swift, and was made possible by a robust and performant software platform. At the heart of the platform is a large PostgreSQL infrastructure, part of it running on some of the largest PC-class hardware available. In this presentation, we will show how we have grown our infrastructure. In particular, the amazing vertical scalability of PG will be showcased with hard numbers (IOPS, transactions/seconds, etc). We will also cover some of the hard lessons we have learned along the way, including near-disasters. Finally, we will look into how innovative features from the PostgreSQL ecosystem enable new approaches to our scalability challenge.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
This document provides an overview of Apache Spark and a hands-on workshop for using Spark. It begins with a brief history of Spark and how it evolved from Hadoop to address limitations in processing iterative tasks and keeping data in memory. Key Spark concepts are explained including RDDs, transformations, actions and Spark's execution model. New APIs in Spark SQL, DataFrames and Datasets are also introduced. The workshop agenda includes an overview of Spark followed by a hands-on example to rank Colorado counties by gender ratio using census data and both RDD and DataFrame APIs.
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
Spark Search is a personal project that integrates Lucene with Apache Spark for interactive search, analytics, and machine learning on big data. Experiments showed that indexing large datasets with Lucene directly was faster than using Solr or Elasticsearch on a single node with minimum parallelism, due to their additional overhead. Spark provides an in-memory distributed computing framework that can help address the challenges of indexing and searching big data with Lucene at scale more easily than traditional distributed search technologies. The presentation called for participation to help build out the Spark Search community and project.
Getting Started with Splunk Breakout SessionSplunk
Splunk is a software platform that allows users to search, monitor, and analyze machine-generated big data for security, IT and business intelligence. It collects data from sources like servers, networks, sensors and applications. Splunk can scale from analyzing data from a single computer to very large enterprises handling terabytes of data per day. It provides real-time operational intelligence through universal data ingestion, schema-on-the-fly indexing, and an intuitive search process.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
This document provides an overview of Apache Spark, including:
- Spark allows for fast iterative processing by keeping data in memory across parallel jobs for faster sharing than MapReduce.
- The core of Spark is the resilient distributed dataset (RDD) which allows parallel operations on distributed data.
- Spark comes with libraries for SQL queries, streaming, machine learning, and graph processing.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
Introduction to Spark: Or how I learned to love 'big data' after all.Peadar Coyle
Slides from a talk I will give in early 2016 at the Luxembourg Data Science Meetup. Aim is to give an introduction to Apache Spark, from a Machine Learning experts point of view. Based on various other tutorials out there. This will be aimed at non-specialists.
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018Jorge Lopez-Malla
Talk about how Big Data and geospatial processing worlds are merging to get the best insights.
(The presenetation with effects here: https://docs.google.com/presentation/d/1EniUHMrRR3vQaJp6q0qBdOyZxv62DcSv3-iZXpcfwOM/edit?usp=sharing)
El documento presenta una charla sobre cómo hacer que los datos sean atractivos mediante el uso de técnicas de machine learning como K-means. Explica conceptos como el entrenamiento de algoritmos, su ejecución a gran escala y la representación de datos. También describe las tecnologías Docker, Apache Spark, Jupyter Notebook y Apache Toree que se pueden utilizar para analizar y visualizar datos de forma interactiva.
Este documento presenta cómo optimizar y monitorear trabajos de Spark con la Spark Web. Explica la terminología básica de Spark como aplicaciones, trabajos, etapas y tareas. Luego describe cómo la Spark Web proporciona información sobre aplicaciones, trabajos, etapas, caché y contadores que ayuda a optimizar el DAG. También cubre cómo la Spark Web monitorea trabajos de Spark SQL y Spark Streaming.
Talk about add proxy user in Spark Task execution time given in Spark Summit East 2017 by Jorge López-Malla and Abel Ricon
full video:
https://www.youtube.com/watch?v=VaU1xC0Rixo&feature=youtu.be
Meetup de Spark y su interacción con Kerberos, para verlo como animación: https://docs.google.com/presentation/d/1DCjp_-s9J647Vydt5ltmqfXpS2PrJDo3KzoVz0C9T7Q/edit?usp=sharing
La problemática Big Data ha dejado de ser una nueva moda y se ha asentado como una nueva realidad en nuestro día a día, y la tecnolgía se ha adaptado a esta nueva realidad permitiendonos afortar problemas complejos de una manera sencilla y casi transaparete.
Pero, ¿y nosotros, hemos cambiado la forma de ver los proyectos y de atacar la solución?¿seguimos tratando de solucionar esta nueva problemática con la misma metodología?¿Seguimos creyendo que el Big Data nos va a solucionar todos nuestros problemas por arte de magia?
Esta charla versará sobre como, según la experiencia del ponente en distintos proyectos de distintas áreas de negocio, se han cambiado la forma de afrontar estos y de como se han solucionado los distintos problemas a la hora de afrontar un proyecto Big Data.
Meetup de Apache Spark Madrid sobre los errores que todos cometemos en proyectos Big Data.
Como las animaciones no van muy bien podeis verla en el siguiente enlace:
https://docs.google.com/presentation/d/1W4Foy9u0NkZziQ36I5_00b_e-JlwhSshSFv-hcxaBpM/edit?usp=sharing
Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla
This document discusses how Stratio used big data technologies like Apache Spark to help a middle eastern telecommunications company with data challenges. It describes Stratio as the first Spark-based big data platform and discusses how they helped the telco process over 9.5 million daily events from 9.2 million customers. Specifically, Stratio used Spark and its machine learning library MLLib to build models from millions of data points to recognize patterns and improve network coverage, gather customer insights, and monetize data.
Meetup Spark y la Combinación de sus Distintos MódulosJorge Lopez-Malla
El documento presenta una introducción a Spark y sus distintos módulos. Explica brevemente Spark Core, que incluye RDD, transformaciones y acciones; Spark SQL, que permite consultas SQL sobre RDD; y cómo se relacionan estos módulos. También menciona Spark Streaming y MLlib, pero se enfoca principalmente en describir Spark Core y SQL, y cómo pueden combinarse mediante la creación de DataFrames a partir de RDD o realizando operaciones de join.
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
SMS API Integration in Saudi Arabia| Best SMS API ServiceYara Milbes
Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
4. JORGE LÓPEZ-MALLA
@jorgelopezmalla
After working with traditional
processing methods, I started to
do some R&S Big Data projects
and I fell in love with the Big Data
world. Currently I’m doing some
awesome Big Data projects
and tools at Stratio.
SKILLS
Presentation
5. MARCOS PEÑATE
@marcosmi5
Coming from the Dark Side I found a
place to grow and innovate at Stratio.
I am passionate about astrophysics, a
compulsive SciFi consumer and I really
enjoy automating and Dockerizing
everything around me!
SKILLS
Presentation