The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your saviour.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward, Ofek Alumni
Has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-i...
Day 3 - AWS MySQL Relational Database Service Best Practices for Performance ...Amazon Web Services
Amazon RDS makes it easy to set up, operate, and scale, relational databases in the cloud. Amazon RDS for MySQL supports applications that require up to tens of thousands of IOPS, and allows you to scale on demand without administrative complexity. In this webinar, we will discuss best practices for getting the most out of Amazon RDS for MySQL, as well as techniques for migrating data to and from the service.
Reasons to attend:
- Learn the details of Master Slave dual AZ configuration.
- Learn about cross region replication.
- Learn about Provisioned IOPS and tips on getting the most from your Amazon RDS MySQL Service.
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn about the capabilities of the PostgreSQL database
- Learn about PostgreSQL offerings on AWS
- Learn how to migrate from Oracle to PostgreSQL with minimal disruption
Day 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance DatabaseAmazon Web Services
Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and re-sizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business. In this webinar we review the different types of Amazon RDS available and how to move your existing databases to Amazon RDS with minimum disruption.
Reasons to attend:
- Learn how Amazon RDS can reduce the overhead of running high performance mission critical databases.
- Learn how to migrate your existing database workloads into Amazon RDS running on the AWS Cloud.
- Learn how to scale up and scale down your Amazon RDS instance and save money with reserved instances.
Computational Patterns of the Cloud - QCon NYC 2014Ines Sombra
The Cloud has undoubtedly changed the way we think about computing, IT operations, innovation, and entrepreneurship. But what are the computational patterns that have emerged from the pervasiveness of public clouds? What can we leverage to improve our organizations? And what are the challenges that we face going forward?
In this talk, I will introduce you to cloud computing’s paradigms and discuss their applications with practical examples from Engine Yard’s customers, peers, and partners. We will also cover antipatterns and myths. If you are curious about Cloud computing or want to improve your cloud strategy this talk is for you.
NOTE: Open an issue if you want me to explain something in more detail at the accompanying github repo: https://github.com/Randommood/QConNYC2014/
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!Marco Obinu
Slides presented at SQL Saturday 980 Plovdiv, talking about the different architectures you can implement to protect your on-premises SQL Server workloads on Azure for DR purposes.
Presentation on how developer roles change when meeting cloud infrastructure, and how a a "role driven"/template based VM deployment model helps this separation
Day 3 - AWS MySQL Relational Database Service Best Practices for Performance ...Amazon Web Services
Amazon RDS makes it easy to set up, operate, and scale, relational databases in the cloud. Amazon RDS for MySQL supports applications that require up to tens of thousands of IOPS, and allows you to scale on demand without administrative complexity. In this webinar, we will discuss best practices for getting the most out of Amazon RDS for MySQL, as well as techniques for migrating data to and from the service.
Reasons to attend:
- Learn the details of Master Slave dual AZ configuration.
- Learn about cross region replication.
- Learn about Provisioned IOPS and tips on getting the most from your Amazon RDS MySQL Service.
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn about the capabilities of the PostgreSQL database
- Learn about PostgreSQL offerings on AWS
- Learn how to migrate from Oracle to PostgreSQL with minimal disruption
Day 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance DatabaseAmazon Web Services
Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and re-sizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business. In this webinar we review the different types of Amazon RDS available and how to move your existing databases to Amazon RDS with minimum disruption.
Reasons to attend:
- Learn how Amazon RDS can reduce the overhead of running high performance mission critical databases.
- Learn how to migrate your existing database workloads into Amazon RDS running on the AWS Cloud.
- Learn how to scale up and scale down your Amazon RDS instance and save money with reserved instances.
Computational Patterns of the Cloud - QCon NYC 2014Ines Sombra
The Cloud has undoubtedly changed the way we think about computing, IT operations, innovation, and entrepreneurship. But what are the computational patterns that have emerged from the pervasiveness of public clouds? What can we leverage to improve our organizations? And what are the challenges that we face going forward?
In this talk, I will introduce you to cloud computing’s paradigms and discuss their applications with practical examples from Engine Yard’s customers, peers, and partners. We will also cover antipatterns and myths. If you are curious about Cloud computing or want to improve your cloud strategy this talk is for you.
NOTE: Open an issue if you want me to explain something in more detail at the accompanying github repo: https://github.com/Randommood/QConNYC2014/
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!Marco Obinu
Slides presented at SQL Saturday 980 Plovdiv, talking about the different architectures you can implement to protect your on-premises SQL Server workloads on Azure for DR purposes.
Presentation on how developer roles change when meeting cloud infrastructure, and how a a "role driven"/template based VM deployment model helps this separation
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...Amazon Web Services
As organizations look to improve application performance and decrease costs, they are increasingly looking to migrate from commercial database engines into open source. Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. In this webinar, we will cover how to use Database Migration Service (DMS) to go about the migration, and how to use the schema conversion tool to convert schemas into Amazon Aurora. We’ll then follow with a quick demo of the entire process, and close with tips and best practices.
Learning Objectives:
Understand how AWS Database migration can help you migrate from a commercial database into Amazon Aurora to improve application performance and decrease database costs.
In our first Windows webinar, find out about the benefits of migrating your Windows workloads to AWS. During the session, we will explain why AWS makes your Windows applications faster, more reliable and more secure. He will also talk about how to bring your own license (BYOL), how to architect, deploy, and manage your Windows platforms on AWS.
Deep Dive on MySQL Databases on AWS - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn about MySQL deployment options on AWS
- Learn how to maintain high availability and security of your data
- Learn how to migrate MySQL databases to Amazon RDS
AWS offers customers a range of different database options. These include Amazon DynamoDB, a fully-managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data as well as Amazon Relational Database Service (RDS), a service that makes it easy to set up, operate, and scale a relational database in the cloud with support for MySQL, Microsoft SQL Server, PostgreSQL, and Oracle Database. In this session you’ll get an overview of AWS database options and how they might help support your application and see how to get started.
This session, gives an insider view of some the innovations that help make the AWS Cloud unique. He will show examples of AWS networking innovations from the interregional network backbone, through custom routers and networking rotocol stack, all the way down to individual servers. He will show examples from AWS server hardware, storage, and power distribution and then, up the stack, in high scale streaming data processing.
Amazon Web Services (AWS) offers a wide range of database options to fit your application requirements. From database services that are fully managed and that can be launched in minutes with just a few clicks to self-managed databases running on EC2. AWS managed database services include Amazon Relational Database Service (Amazon RDS), with support for six commonly used database engines, Amazon Aurora, a MySQL and PostgreSQL-compatible relational database, Amazon DynamoDB, a NoSQL database service or Amazon Redshift, a petabyte-scale data warehouse service. AWS also provides the AWS Database Migration Service, a service which makes it easy and inexpensive to migrate your databases to AWS cloud.
In this webinar, we take a closer look at the AWS database offerings and learn how to quickly select, set up, operate, and scale your database in the cloud.
Learning Objectives:
• Gain insights into the AWS database offering and know which to select for your workload.
• Learn how the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS) can facilitate and simplify migrating your business critical applications to Amazon Web Services.
• Learn how Amazon DynamoDB Accelerator (DAX) can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second.
• Hear from our partners like Version1 and Clckwrk who can help you in your journey towards Database freedom.
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksAmazon Web Services
Organizations face significant challenges moving their applications to the cloud when they require a standard file system interface for accessing their cloud data. In this technical session, we will explore the world’s first cloud-scale file system and its targeted use cases. Attendees will learn about the Amazon Elastic File System (EFS) features and benefits, how to identify applications that are appropriate for use with Amazon EFS, and details about its performance and security models. We will highlight and demonstrate how to deploy Amazon EFS in one of our most common use cases and will share tips for success throughout.
Learning Objectives:
• Recognize why and when to use Amazon EFS
• Understand key technical/security concepts
• Learn how to leverage EFS’s performance
• See a demo of EFS in action
• Review EFS’s economics
Training for AWS Solutions Architect at http://zekelabs.com/courses/amazon-web-services-training-bangalore/.This slide describes about database offering, Relational Database services (RDS), Read Replica, Multi-AZ, DynamoDB, Elasticache, Redshift, Aurora and Neptune
___________________________________________________
zekeLabs is a Technology training platform. We provide instructor led corporate training and classroom training on Industry relevant Cutting Edge Technologies like Big Data, Machine Learning, Natural Language Processing, Artificial Intelligence, Data Science, Amazon Web Services, DevOps, Cloud Computing and Frameworks like Django,Spring, Ruby on Rails, Angular 2 and many more to Professionals.
Reach out to us at www.zekelabs.com or call us at +91 8095465880 or drop a mail at info@zekelabs.com
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for RedisAmazon Web Services
Learn how to use Amazon ElastiCache with AWS IoT and AWS Lambda to create serverless solutions that let you rapidly make use of large and multisource data sets.
O Amazon Redshift é um data warehouse rápido, gerenciado e em escala de petabytes que torna mais simples e econômica a análise de todos os seus dados usando as ferramentas de inteligência de negócios de que você já dispõe. Comece aos poucos, por apenas 0,25 USD por hora, sem compromissos, e aumente a escala até petabytes por 1.000 USD por terabyte por ano, menos de um décimo do custo das soluções tradicionais. Normalmente, os clientes relatam uma compactação de 3x, que reduz seus custos para 333 USD por terabyte não compactado por ano.
AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...Amazon Web Services
In this workshop, you migrate a sample sporting event and ticketing database from Oracle or Microsoft SQL Server to Amazon Aurora or Postgre SQL using the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS). The workshop includes the migration of tables, indexes, procedures, functions, constraints, views, and more. We run SCT on a Amazon EC2 Windows instance--bring a laptop with Remote Desktop (or some other method of connecting to the Windows instance). Ideally, you should be familiar with relational databases, especially Oracle or SQL Server and PostgreSQL or Aurora, to get the most from this session. Additionally, attendees should be familiar with SCT and DMS. Familiarity with SQL Developer and pgAdmin III will be helpful but is not required.
Prerequisites:
- Participants should have an AWS account established and available for use during the workshop.
- Please bring your own laptop.
Introduction to Storage on AWS - AWS Summit Cape Town 2017Amazon Web Services
With AWS, you can choose the right storage service for the right use case. This session shows the range of AWS choices that are available to you: Amazon S3, Amazon EBS, Amazon EFS, Amazon Glacier and Cloud Data Migration solutions.
AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...Amazon Web Services
Through a combination of Amazon ECS and open source technologies, customers are able to build portable CI/CD pipelines on AWS. As container based deployments become more complex, they require additional rigging for integration. In this session, we show how popular Apache products like Kakfa, Storm, and Zookeeper are being deployed on top of Amazon ECS. We hear from HERE, a provider of mapping data, technologies, and services to the automotive, consumer, and enterprise sectors about an approach that leverages Consul from Hashicorp and Amazon ECS clusters for short-cycle deployments and tag-based environment promotion.
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)Amazon Web Services
Large-scale enterprise migration can be a complex undertaking, especially for organizations that re-architect solutions to leverage the benefits of the Cloud. FINRA, which regulates US equities and options markets, recently completed a 2.5-year migration and re-architecture of its Big Data platform. Their platform consumes billions of market events every day. FINRA has developed scalable platforms and services on AWS that enable migrating enterprise applications and business functions to the Cloud quickly. Their data management platform takes advantage of AWS storage and compute products. In this session, IT influencers and decision makers will learn lessons from FINRA’s migration, including how to create an enterprise-class Cloud architecture and which technology skills are required for transitioning to the Cloud. We also share examples of the business value FINRA has realized.
Use case of the usage of Apache Spark @Windward Ltd.
Video lecture on YouTube: https://www.youtube.com/watch?v=rPO6P5YIKUI
Showing the domain of the company,
A short introduction of Apache Spark,
And the Tool Box used @Windward Ltd to form a working production Spark Data Pipeline.
Migrate from SQL Server or Oracle into Amazon Aurora using AWS Database Migra...Amazon Web Services
As organizations look to improve application performance and decrease costs, they are increasingly looking to migrate from commercial database engines into open source. Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. In this webinar, we will cover how to use Database Migration Service (DMS) to go about the migration, and how to use the schema conversion tool to convert schemas into Amazon Aurora. We’ll then follow with a quick demo of the entire process, and close with tips and best practices.
Learning Objectives:
Understand how AWS Database migration can help you migrate from a commercial database into Amazon Aurora to improve application performance and decrease database costs.
In our first Windows webinar, find out about the benefits of migrating your Windows workloads to AWS. During the session, we will explain why AWS makes your Windows applications faster, more reliable and more secure. He will also talk about how to bring your own license (BYOL), how to architect, deploy, and manage your Windows platforms on AWS.
Deep Dive on MySQL Databases on AWS - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn about MySQL deployment options on AWS
- Learn how to maintain high availability and security of your data
- Learn how to migrate MySQL databases to Amazon RDS
AWS offers customers a range of different database options. These include Amazon DynamoDB, a fully-managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data as well as Amazon Relational Database Service (RDS), a service that makes it easy to set up, operate, and scale a relational database in the cloud with support for MySQL, Microsoft SQL Server, PostgreSQL, and Oracle Database. In this session you’ll get an overview of AWS database options and how they might help support your application and see how to get started.
This session, gives an insider view of some the innovations that help make the AWS Cloud unique. He will show examples of AWS networking innovations from the interregional network backbone, through custom routers and networking rotocol stack, all the way down to individual servers. He will show examples from AWS server hardware, storage, and power distribution and then, up the stack, in high scale streaming data processing.
Amazon Web Services (AWS) offers a wide range of database options to fit your application requirements. From database services that are fully managed and that can be launched in minutes with just a few clicks to self-managed databases running on EC2. AWS managed database services include Amazon Relational Database Service (Amazon RDS), with support for six commonly used database engines, Amazon Aurora, a MySQL and PostgreSQL-compatible relational database, Amazon DynamoDB, a NoSQL database service or Amazon Redshift, a petabyte-scale data warehouse service. AWS also provides the AWS Database Migration Service, a service which makes it easy and inexpensive to migrate your databases to AWS cloud.
In this webinar, we take a closer look at the AWS database offerings and learn how to quickly select, set up, operate, and scale your database in the cloud.
Learning Objectives:
• Gain insights into the AWS database offering and know which to select for your workload.
• Learn how the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS) can facilitate and simplify migrating your business critical applications to Amazon Web Services.
• Learn how Amazon DynamoDB Accelerator (DAX) can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second.
• Hear from our partners like Version1 and Clckwrk who can help you in your journey towards Database freedom.
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksAmazon Web Services
Organizations face significant challenges moving their applications to the cloud when they require a standard file system interface for accessing their cloud data. In this technical session, we will explore the world’s first cloud-scale file system and its targeted use cases. Attendees will learn about the Amazon Elastic File System (EFS) features and benefits, how to identify applications that are appropriate for use with Amazon EFS, and details about its performance and security models. We will highlight and demonstrate how to deploy Amazon EFS in one of our most common use cases and will share tips for success throughout.
Learning Objectives:
• Recognize why and when to use Amazon EFS
• Understand key technical/security concepts
• Learn how to leverage EFS’s performance
• See a demo of EFS in action
• Review EFS’s economics
Training for AWS Solutions Architect at http://zekelabs.com/courses/amazon-web-services-training-bangalore/.This slide describes about database offering, Relational Database services (RDS), Read Replica, Multi-AZ, DynamoDB, Elasticache, Redshift, Aurora and Neptune
___________________________________________________
zekeLabs is a Technology training platform. We provide instructor led corporate training and classroom training on Industry relevant Cutting Edge Technologies like Big Data, Machine Learning, Natural Language Processing, Artificial Intelligence, Data Science, Amazon Web Services, DevOps, Cloud Computing and Frameworks like Django,Spring, Ruby on Rails, Angular 2 and many more to Professionals.
Reach out to us at www.zekelabs.com or call us at +91 8095465880 or drop a mail at info@zekelabs.com
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for RedisAmazon Web Services
Learn how to use Amazon ElastiCache with AWS IoT and AWS Lambda to create serverless solutions that let you rapidly make use of large and multisource data sets.
O Amazon Redshift é um data warehouse rápido, gerenciado e em escala de petabytes que torna mais simples e econômica a análise de todos os seus dados usando as ferramentas de inteligência de negócios de que você já dispõe. Comece aos poucos, por apenas 0,25 USD por hora, sem compromissos, e aumente a escala até petabytes por 1.000 USD por terabyte por ano, menos de um décimo do custo das soluções tradicionais. Normalmente, os clientes relatam uma compactação de 3x, que reduz seus custos para 333 USD por terabyte não compactado por ano.
AWS re:Invent 2016: Workshop: Converting Your Oracle or Microsoft SQL Server ...Amazon Web Services
In this workshop, you migrate a sample sporting event and ticketing database from Oracle or Microsoft SQL Server to Amazon Aurora or Postgre SQL using the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS). The workshop includes the migration of tables, indexes, procedures, functions, constraints, views, and more. We run SCT on a Amazon EC2 Windows instance--bring a laptop with Remote Desktop (or some other method of connecting to the Windows instance). Ideally, you should be familiar with relational databases, especially Oracle or SQL Server and PostgreSQL or Aurora, to get the most from this session. Additionally, attendees should be familiar with SCT and DMS. Familiarity with SQL Developer and pgAdmin III will be helpful but is not required.
Prerequisites:
- Participants should have an AWS account established and available for use during the workshop.
- Please bring your own laptop.
Introduction to Storage on AWS - AWS Summit Cape Town 2017Amazon Web Services
With AWS, you can choose the right storage service for the right use case. This session shows the range of AWS choices that are available to you: Amazon S3, Amazon EBS, Amazon EFS, Amazon Glacier and Cloud Data Migration solutions.
AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...Amazon Web Services
Through a combination of Amazon ECS and open source technologies, customers are able to build portable CI/CD pipelines on AWS. As container based deployments become more complex, they require additional rigging for integration. In this session, we show how popular Apache products like Kakfa, Storm, and Zookeeper are being deployed on top of Amazon ECS. We hear from HERE, a provider of mapping data, technologies, and services to the automotive, consumer, and enterprise sectors about an approach that leverages Consul from Hashicorp and Amazon ECS clusters for short-cycle deployments and tag-based environment promotion.
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)Amazon Web Services
Large-scale enterprise migration can be a complex undertaking, especially for organizations that re-architect solutions to leverage the benefits of the Cloud. FINRA, which regulates US equities and options markets, recently completed a 2.5-year migration and re-architecture of its Big Data platform. Their platform consumes billions of market events every day. FINRA has developed scalable platforms and services on AWS that enable migrating enterprise applications and business functions to the Cloud quickly. Their data management platform takes advantage of AWS storage and compute products. In this session, IT influencers and decision makers will learn lessons from FINRA’s migration, including how to create an enterprise-class Cloud architecture and which technology skills are required for transitioning to the Cloud. We also share examples of the business value FINRA has realized.
Use case of the usage of Apache Spark @Windward Ltd.
Video lecture on YouTube: https://www.youtube.com/watch?v=rPO6P5YIKUI
Showing the domain of the company,
A short introduction of Apache Spark,
And the Tool Box used @Windward Ltd to form a working production Spark Data Pipeline.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
We will be showing the use case of the implementation of a Data Pipeline in the maritime domain @Windward via Spark applications.
The process was converting a Monolith application to a fully distributed and scalable application.
We'll be talking about all the tools and the process of taking an idea and developing Spark applications around it, And will show the development of an application End to End, from DevOps to the method of thinking about the development of applications, showing use-cases and the "lessons learned" at Windward Ltd, I hope that after the talk, it will give you some more Practical tools to "Spark"ing your way around.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Similar to Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni (20)
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides. In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry. Of course how can we leave out the real enabler of the whole deal, "The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Demi Ben-Ari
Talk that specifies the history and the reasons to start using Kubernetes and implementing a microservices architecture. Talking about Docker, Kubernetes basic terms and some of the pitfalls that you can get too while implementing it.
Also mentioning the use case of Panorays.
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysDemi Ben-Ari
Once you have the “Great groundbreaking Idea of your life”, what are all of the mistakes that you can do to fail it in the world of entrepreneurship, we’ll talk about the idea, partners, fund raising and company culture that you’d like to create, first steps to creating the best company possible and have some fun during so.
Promise to share from experience.
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysDemi Ben-Ari
To defend against attacks, think like a hacker. But does that mean you need to be a DevOps expert? Security researchers today need to discover new attack techniques. However, much of their focus is diverged to backend coding. We share how to build an infrastructure for researchers that allows them concentrate on business logic and writing hacker “tasks”. Using Docker and Kubernetes on Google Cloud, these tasks can then be performed in parallel and without a lot of DevOps hassle. Our technique removes two common barriers: first, long and risky deployment processes and second, low transparency within the production system.
Promise to share the stupid things too.
Community, Unifying the Geeks to Create Value - Demi Ben-AriDemi Ben-Ari
After running developer communities over the past 3 years, meeting a lot of great people and learning a lot of things about the Israeli Hi-Tech ecosystem,
I’ll share all that I’ve learned about what it actually means to create what is called a “Community”.
The value that you as a community lead can give to the people in it and the things that you can gain as a geek out of it.
You’ll be surprised to learn how easy and hard it can be at the same time.
I’ll tell about the steps that I’ve taken in this journey, what in my opinion might kill the concept of “Developer Community”,
by the end of the talk you'll have as many tools as possible for you to be able to create a community of your own.
Demi Ben Ari - Apache Spark 101 - First Steps into distributed computing:
The world has changed, having one huge server won’t do the job, the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with streaming, SQL, machine learning and graph processing. Showing the basics of Apache Spark and distributed computing.
Demi is a Software engineer, Entrepreneur and an International Tech Speaker.
Demi has over 10 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community and Google Developer Group Cloud.
Big Data Expert, but interested in all kinds of technologies, from front-end to backend, whatever moves data around.
Know the Startup World - Demi Ben-Ari - Ofek AlumniDemi Ben-Ari
Insights and explanation about the Hi-Tech industry and about the terms of Startup companies.
Brought by the Ofek Alumni association supporting it's alumni.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Know the Startup World - Demi Ben Ari - Ofek AlumniDemi Ben-Ari
Insights and explanation about the Hi-Tech industry and about the terms of Startup companies.
Brought by the Ofek Alumni association supporting it's alumni.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides.
In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry.
Of course how can we leave out the real enabler of the whole deal,
"The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
Everybody wants to go on the “Big Data” hype cycle, “To do Scale”, to use the coolest tools in the market like Hadoop, Apache Spark, Apache Cassandra, etc.
But do they ask themselves is there really a reason for that?
In the talk we’ll make a brief overview to all of the technologies in the Big Data world nowadays and we’ll talk about the problems that really emerge when you’d like to enter the great world of Big Data handling.
Showing you the Hadoop ecosystem and Apache Spark and all of the distributed tools leading the market today, will give you all a notion of what will be the real costs entering that world.
Promise that I’ll share some stories from the trenches :)
(And about the “pool” thing...I don’t really know how to swim)
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Your Digital Assistant.
Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data.
Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you.
Feasible Features
One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated
User Friendly – can be easily used on Android, iOS, and Web Interface
Multiple Accessibility – Log in through any device from any place at any time
One app for all industries – a Visitor Management System that works for any organisation.
Stress-free Sign-up
Visitor is registered and checked-in by the Receptionist
Host gets a notification, where they opt to Approve the meeting
Host notifies the Receptionist of the end of the meeting
Visitor is checked-out by the Receptionist
Host enters notes and remarks of the meeting
Customizable Components
Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings
Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors
VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information
Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments
Alerts & Notifications – Get notified on SMS, email, and application
Parking Management – Manage availability of parking space
Individual log-in – Every user has their own log-in id
Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system
Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization.
"Secure Your Premises with VizMan (VMS) – Get It Now"
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
2. About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
Co-Founder “Big Things” Big Data Community
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAF
• Interested in almost every kind of technology – A True Geek
3.
4. Agenda
What is Spark?
Spark Infrastructure and Basics
Spark Features and Suite
◦ Spark-Shell Live Demo
◦ Cassandra & Spark
Development with Spark
Conclusion
5. What is Spark?
Efficient Usable
General execution
graphs
In-memory storage
Rich APIs in Java,
Scala, Python
Interactive shell
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
6. What is Spark?
Apache Spark is a general-purpose, cluster
computing framework
Spark does computation In Memory & on
Disk
Apache Spark has low level and high level
APIs
7. Spark Philosophy
Make life easy and productive for data
scientists
Well documented, expressive API’s
Powerful domain specific libraries
Easy integration with storage systems
… and caching to avoid data movement
Predictable releases, stable API’s
Stable release each 3 months
8.
9. Spark Contributors
Highly active open source community
(09/2015)
◦ https://github.com/apache/spark/
https://www.openhub.net/p/apache-spark
10. About Spark project
Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
Interactive shell Spark in Scala and Python
◦ (spark-shell, pyspark)
Currently stable in version 1.6
16. Driver and Spark Context
Spark Context is your “handle” to the Spark
cluster.
The driver program contains the main
method.
You use your Spark Context to access your
cluster.
◦ Configure the connection to the cluster
◦ It lets you create RDDs.
The variable named sc (for the Spark
Context) is already defined in your Driver in
the Spark Shell.
17. What’s an RDD?
Resilient Distributed Datasets
◦ Fault tolerant
◦ Parallel data structure
◦ Distributed on the nodes in the cluster
◦ Immutable!!!
◦ Can persist intermediate results in memory
◦ Transformations are operators and are Lazy
evaluated
20. RDD Persistence and
partitioning
Users have control which RDD will be
reuse (in memory and disk storage)
◦ Persist, Cache, Unpersist
Users can order an RDD’s to be
partitioned across machines
Only the lost partitions of an RDD
need to be recomputed upon failure.
21. Spark execution engine
Spark uses lazy evaluation
◦ Runs the code only when it encounters an
action operation
There is no need to design and write a
single complex map-reduce job.
◦ In Spark we can write smaller and
manageable operations
◦ Spark will group operations together
22. Spark execution engine
Serializes your code to the executors
◦ Can choose your serialization method
(Java serialization, Kryo)
In Java - functions are specified as
objects that implement one of Spark’s
Function interfaces.
◦ Can use the same method of
implementation in Scala and Python as
well.
23. Persistence layers for Spark
Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
File formats
◦ Text file
CSV, TSV, Plain Text
◦ Sequence File
◦ AVRO
◦ Parquet
24.
25.
26. Spark Core Features
Distributed In memory Computation
Stand alone and Local Capabilities
History server for Spark UI
Resource management Integration
Unified job submission tool
27. History Server
Can be run on all Spark deployments,
◦ Stand Alone, YARN, Mesos
Integrates both with YARN and Mesos
In Yarn / Mesos, run history server as
a daemon.
30. Cassandra & Spark
Cassandra cluster
◦ Bare metal vs. On the cloud
DSE – DataStax Enterprise
◦ Cassandra & Spark in each node
Vs
◦ Separate Cassandra and Spark clusters
32. Where do I start from?!
Download spark as a package
◦ Run it on “local” mode (no need of a real
cluster)
◦ “spark-ec2” scripts to ramp-up a Stand Alone
mode cluster
◦ Amazon Elastic Map Reduce (EMR)
Yarn vs. Mesos vs. Stand Alone
33. Running Environments
Development – Testing – Production
◦ Don’t you need more?
◦ Be as flexible as you can
Cluster Utilization
◦ Unified Cluster for all environments
Vs.
◦ Cluster per Environment
(Cluster per Data Center)
Configuration
◦ Local Files vs. Distributed
34. Saving and Maintaining the
Data Local File System – Not effective in a distributed
environment
HDFS
◦ Might be very Expensive
◦ Locality Rules – Spark + HDFS node + Same machine
S3
◦ High latency and pretty slow but low costs
Cassandra
◦ Rigid data model
◦ Very fast and depends on the Volume of the data can be
35. DevOps – Keep It Simple,
Stupid Linux
◦ Bash scripts
◦ Crontab
Automation via Jenkins
Continuous Deployment – with every GIT push
Dev Testing
Live
Staging
Production
Daily ManualAutomaticAutomatic
36. Build Automation
Maven
◦ Sonatype Nexus artifact management
-
◦ Deploy and Script generation scripts
◦ Per Environment Testing
◦ Data Validation
◦ Scheduled Tasks
40. Conclusion
Spark is a popular and very powerful
distributed in memory computation
framework
Broadly used and has lots of contributors
Leading tool in the new world of Petabytes
of unexplored data in the world