OASIS is a web-based data analysis platform that enables employees at LINE to analyze data from their services in a multi-tenant Hadoop cluster. It provides features like query execution, result visualization, code execution in Scala/Python/R, scheduled execution, and result sharing. OASIS addresses security, stability, and feature requirements by using technologies like Apache Spark on YARN, Kerberos authentication, and Apache Ranger authorization. It also optimizes the "small files" problem in HDFS through an insertOverwrite API and scales out to multiple servers. OASIS currently supports over 1,600 notebooks, 40+ services, and 500+ users at LINE.
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters
Salvatore Sanfilippo – How Redis Cluster works, and why
In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters
Salvatore Sanfilippo – How Redis Cluster works, and why
In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.
AWS Black Belt Online Seminarの最新コンテンツ: https://aws.amazon.com/jp/aws-jp-introduction/#new
過去に開催されたオンラインセミナーのコンテンツ一覧: https://aws.amazon.com/jp/aws-jp-introduction/aws-jp-webinar-service-cut/
Keiji yoshida – data engineer, line corporation
DataEngConf Barcelona
https://www.dataengconf.com/speakers-bcn18
Spark + AI Summit Europehttps://databricks.com/sparkaisummit/europe
Presented in the Apache Spark Masterclass meetup.
This presentation provides an overview of Spark's Standalone resource manager. The standalone resource manager has its own implementation of master and workers for managing the resources in the cluster, but at the same time it also provides abstractions to leverage other resource managers like Hadoop YARN, Mesos and Kubernetes.
AWS Black Belt Online Seminarの最新コンテンツ: https://aws.amazon.com/jp/aws-jp-introduction/#new
過去に開催されたオンラインセミナーのコンテンツ一覧: https://aws.amazon.com/jp/aws-jp-introduction/aws-jp-webinar-service-cut/
Keiji yoshida – data engineer, line corporation
DataEngConf Barcelona
https://www.dataengconf.com/speakers-bcn18
Spark + AI Summit Europehttps://databricks.com/sparkaisummit/europe
Presented in the Apache Spark Masterclass meetup.
This presentation provides an overview of Spark's Standalone resource manager. The standalone resource manager has its own implementation of master and workers for managing the resources in the cluster, but at the same time it also provides abstractions to leverage other resource managers like Hadoop YARN, Mesos and Kubernetes.
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
Amazon EMR is a managed Hadoop service that makes it easy for customers to use big data frameworks and applications like Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3 , Amazon’s highly scalable object storage service. In this webinar, we will introduce the latest release of Amazon EMR. With Amazon EMR release 5.0, customers can now launch the latest versions of popular open source frameworks including Apache Spark 2.0, Hive 2.1, Presto 0.151, Tez 0.8.4, and Apache Hadoop 2.7.2. We will walk through a demo to show you how to deploy a Hadoop environment within minutes. We will cover common use cases and best practices to lower costs using Amazon S3 as your data store and Amazon EC2 Spot Instances, which allow you to bid on space Amazon computing capacity.
Learning Objectives:
• Describe the new features and updated frameworks in Amazon EMR 5.0
• Learn best practices and real-world applications for Amazon EMR
• Understand how to use EC2 Spot pricing to save costs
• Explain the advantages of decoupling storage and compute with Amazon S3 as storage layer for EMR workloads
The Apereo OAE Bootcamp offers an introduction into back-end and front-end development for the Apereo OAE project.
The back-end development part focuses on learning the different extension points behind the scenes in the service layer of OAE. A back-end component for OAE that exposes a REST API is built from scratch.
Back-end development topics include:
- Node.js NPM module system
- OAE back-end application life-cycle
- Data-modelling with Apache Cassandra and writing CQL queries from Node.js
- Using the OAE APIs to expose back-end functionality for the web VIA RESTful APIs
- Writing back-end unit tests using Grunt and Mocha. If time permits, the following will also be covered:
- Integrating with OAE's ElasticSearch query and index functionality
- Integrating with OAE's Activity and Notifications functionality
- Integration with OAE's Admin Configuration functionality
The front-end development part focuses on writing a UI widget using the REST APIs developed in the back-end development part.
Front-end development topics include:
- Integrating with the OAE Widget loading system
- Writing internationalizable templates with TrimPath and the widget i18n and l10n functionality
- Interacting with the core OAE UI APIs
- Using bootstrap 3 to design responsive UI layouts for your widgets
- Writing front-end unit tests using Grunt and CasperJS
AWS offers a number of services that help you easily deploy and run applications in the cloud. Come to this session to learn how to choose among these options. Through interactive demonstrations, this session will show you how to get an application running using AWS OpsWorks and AWS Elastic Beanstalk application management services. You will also learn how to use AWS CloudFormation templates to document, version control, and share your application configuration. This session will cover topics like application updates, customization and working with resources such as load balancers and databases.
This session is recommended for people who understand AWS and want to know more about deployment options for their applications.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
Amazon Athena is a new interactive query service that makes it easy to analyze data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this session, we will show you how easy is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
Amazon Athena is a new interactive query service that makes it easy to analyze data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this session, we will show you how easy is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
ODSC East 2017 - How to use Zeppelin and Spark to document your research.
Reproducible research documents not just the findings of a study but the exact code required to produce those findings. Reproducible research is a requirement for study authors to reliably repeat their analysis or accelerate new findings by applying the same techniques to new data. The increased transparency allows peers to quickly understand and compare the methods of the study to other studies and can lead to higher levels of trust, interest and eventually more citations of your work. Big data introduces some new challenges for reproducible research. As our data universe expands and the open data movement grows, more data is available than ever to analyze, and the possible combinations are infinite. Data cleaning and feature extraction often involve lengthy sequences of transformations. The space allotted for publications is not adequate to effectively describe all the details, so they can be reviewed and reproduced by others. Fortunately, the open source community is addressing this need with Apache Spark, Zeppelin and Hadoop. Apache Spark 2.0 makes it even simpler and faster to harness the power of a Hadoop computing cluster to clean, analyze, explore and train machine learning models on large data sets. Zeppelin web-based notebooks capture and share code and interactive visualizations with others. After this session you will be able to create a reproducible data science pipeline over large data sets using Spark, Zeppelin, and a Hadoop distributed computing cluster. Learn how to combine Spark with other supported interpreters to codify your results from cleaning to exploration to feature extraction and machine learning. Discover how to share your notebooks and data with others using the cloud. This talk will cover Spark and show examples, but it is not intended to be a complete tutorial on Spark.
Cask Webinar
Date: 08/10/2016
Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0
In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit.
Some of the highlights include:
- Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger.
- Preview mode - Ability to preview and debug data pipelines before deploying them.
- Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines
- Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming.
- Data usage analytics - Ability to report application usage of data sets.
- And much more!
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
Similar to OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster (20)
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
5. DATA PLATFORM
LINE Ads Platform
LINE Creators Market
LINE NEWS
LINE Pay
LINE LIVE
LINE MOBILE
Hadoop Cluster (Data Lake)
LINE Ads
Platform
LINE Creators
Market
LINE NEWS LINE Pay
LINE LIVE LINE MOBILE
ETL
Analysis
BI / Reporting
6. DATA DEMOCRATIZATION
• Make Hadoop cluster public within LINE
• Enable employees to analyze their service’s data as they like
• Speed up data analysis process and decision making
Multi-tenant Hadoop Cluster
LINE Ads
Platform
LINE Creators
Market
LINE Ads Platform LINE Creators Market
8. 1. SECURITY
• Strict access control
• Allow employees to access only their service’s data
Multi-tenant Hadoop Cluster
LINE Ads
Platform
LINE Creators
Market
LINE Ads Platform LINE Creators Market
13. 3. FEATURES
• Query execution
• Query result visualization
• Code execution (Scala, Python, R)
• Scheduled execution
• Result sharing
• Result access control
14. 3. FEATURES
Apache Zeppelin • Has security and stability issues
Jupyter
• Does not support query result visualization
• Does not support scheduled execution
Redash
• Does not support Spark application code execution
• Does not support user impersonation
Apache Superset
• Does not support Spark application code execution
• Does not support scheduled execution
Apache Hue
• Relies on Apache Livy
• Does not support concurrent Spark SQL execution
• Does not support Spark application sharing
16. APACHE ZEPPELIN 0.7.3 : SECURITY
• Launch Spark application with another user account
• Cheats Apache Ranger
Spark Application : User BApache Zeppelin
HDFS / Apache Ranger
User A
17. APACHE ZEPPELIN 0.7.3 : STABILITY
• Runs only on a single server
• Does not support “yarn-cluster” mode
• Easy to freeze
Apache Zeppelin Server
Apache Zeppelin Driver Program 1 Driver Program 2
Driver Program 3 Driver Program 4 Driver Program 5
23. SPARK APPLICATION
• Launch per notebook session
• Use notebook’s author’s account for accessing HDFS
• Support Spark, Spark SQL, PySpark, and SparkR
25. NOTEBOOK SHARING
• Notebooks can be shared within a “space”
• “space” : root directory of notebooks for each LINE service
• Access rights: “read write”, “read only”
Space 1
Read Write
Users
Read Only
Users
Notebooks
Space 2
Read Write
Users
Read Only
Users
Notebooks
26. PARAMETERS
• Parameters can be injected into a notebook
• Read only users can redraw a notebook while changing its parameters
28. SMALL FILES PROBLEM
• Consume a lot of NameNode’s memory
• Degrade search performance
• Default value of spark.sql.shuffle.partitions : 200
29. DATA INSERTION API
• oasis.insertOverwrite(query, table)
• Replace spark.sql(query).write.mode(“overwrite”).insertInto(table)
• Number of files are optimized automatically
31. OASIS.INSERTOVERWRITE(QUERY, TABLE)
3. Calculate optimal number of files
4. Recreate temporary table’s data with optimal number of files
Spark
Application Tmp Table
3. filesNum = total file size / block size
4. spark.sql(query).repartition(filesNum).write
.mode(“overwrite”).insertInto(tmpTable)
32. OASIS.INSERTOVERWRITE(QUERY, TABLE)
5. Drop Hive partitions from target table
6. Move temporary table’s files to target table
Spark
Application
Tmp Table
5. spark.sql(“alter table … drop partition …”)
6. FileSystem.get(…).rename(tmpPath, targetPath)
Target Table
33. OASIS.INSERTOVERWRITE(QUERY, TABLE)
7. Add Hive partitions to the target table
8. Drop the temporary table
Spark
Application
Tmp Table
7. spark.sql(“alter table … add partition …”)
Target Table
8. spark.sql(“drop table …”)
35. SPARK INTERPRETER ROUTING
• Route information is managed on Redis
• Code of the same notebook session goes to the same interpreter
Spark Interpreter 1
Redis
Frontend / API 1
End Users
Load Balancer
Frontend / API 2 Spark Interpreter 2
Update
route Information
Search
route Information
Spark
application
code
Round-robin
36. MULTIPLE JOB SCHEDULERS
• Make OASIS Job Scheduler highly available
• Utilize Quartz’s clustering feature
MySQL
Job Scheduler 1 Job Scheduler 2 Job Scheduler 3
Quartz Clustering
46. RECAP : OASIS
• Data analysis platform for a multi-tenant Hadoop cluster
• Data can be extracted, processed, visualized, and shared
• Used for reporting, data monitoring, ad hoc analysis, etc. at LINE