This document discusses Project Amaterasu, a tool for simplifying the deployment of big data applications. Amaterasu uses Mesos to deploy Spark jobs and other frameworks across clusters. It defines workflows, actions, and environments in YAML and JSON files. Workflows contain a series of actions like Spark jobs. Actions are written in Scala and interface with Amaterasu's context. Environments configure settings for different clusters. Amaterasu aims to improve collaboration and testing for big data teams through continuous integration and deployment of data pipelines.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech TalksAmazon Web Services
The volume of data businesses create and process is growing every day. To get the most value out of this data, companies often invest in traditional BI tools. These tools however require investments in costly on-premises hardware and software. It takes weeks or months of data engineering time to build complex data models; not to mention the additional infrastructure needed to maintain fast query performance as data sets grow. In a nutshell, traditional BI tools are expensive and complex, and prevent companies from making analytics ubiquitous among business users. Amazon QuickSight is built from the ground up to solve these problems by bringing the scale and flexibility of the AWS Cloud and by providing a business user focused experience to business analytics.
Learning Objectives:
• Learn about the capabilities and features of Amazon QuickSight
• Learn about the benefits of Amazon QuickSight
• Learn about the different use cases
• Learn how to get started using Amazon QuickSight
• Understand how to connect to your data sources in the cloud or on-premises
• Learn how to use QuickSight’s SPICE and AutoGraph technologies to quickly spin-up charts and graphs
• Discover insights with your colleagues via Stories and become an analytics pro without any complex BI knowledge
Understanding DataOps and Its Impact on Application QualityDevOps.com
Modern day applications are data driven and data rich. The infrastructure your backends run on are a critical aspect of your environment, and require unique monitoring tools and techniques. In this webinar learn about what DataOps is, and how critical good data ops is to the integrity of your application. Intelligent APM for your data is critical to the success of modern applications. In this webinar you will learn:
The power of APM tailored for Data Operations
The importance of visibility into your data infrastructure
How AIOps makes data ops actionable
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
Synthesis Webcast with Eric Kavanagh and Tamr
DataOps is an emerging set of practices, processes, and technologies for building and automating data pipelines to meet business needs quickly. As these pipelines become more complex and development teams grow in size, organizations need better collaboration and development processes to govern the flow of data and code from one step of the data lifecycle to the next – from data ingestion and transformation to analysis and reporting.
DataOps is not something that can be implemented all at once or in a short period of time. DataOps is a journey that requires a cultural shift. DataOps teams continuously search for new ways to cut waste, streamline steps, automate processes, increase output, and get it right the first time. The goal is to increase agility and cycle times, while reducing data defects, giving developers and business users greater confidence in data analytic output.
This webcast examines how organizations adopt DataOps practices in the field. It will review results of an Eckerson Group survey that sheds light on the rate and scope of DataOps adoption. It will also describe case studies of organizations that have successfully implemented DataOps practices, the challenges they have encountered and benefits they’ve received.
Tune into our webcast to learn:
- User perceptions of DataOps
- The rate of DataOps adoption by industry and other demographic variables
- DataOps adoption by technique and component (i.e., agile, test automation, orchestration, continuous development/continuous integration)
- Key challenges organizations face with DataOps
- Key benefits organizations experience with DataOps
- Best practices in doing DataOps
- Case studies and anecdotes of DataOps at companies
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech TalksAmazon Web Services
The volume of data businesses create and process is growing every day. To get the most value out of this data, companies often invest in traditional BI tools. These tools however require investments in costly on-premises hardware and software. It takes weeks or months of data engineering time to build complex data models; not to mention the additional infrastructure needed to maintain fast query performance as data sets grow. In a nutshell, traditional BI tools are expensive and complex, and prevent companies from making analytics ubiquitous among business users. Amazon QuickSight is built from the ground up to solve these problems by bringing the scale and flexibility of the AWS Cloud and by providing a business user focused experience to business analytics.
Learning Objectives:
• Learn about the capabilities and features of Amazon QuickSight
• Learn about the benefits of Amazon QuickSight
• Learn about the different use cases
• Learn how to get started using Amazon QuickSight
• Understand how to connect to your data sources in the cloud or on-premises
• Learn how to use QuickSight’s SPICE and AutoGraph technologies to quickly spin-up charts and graphs
• Discover insights with your colleagues via Stories and become an analytics pro without any complex BI knowledge
Understanding DataOps and Its Impact on Application QualityDevOps.com
Modern day applications are data driven and data rich. The infrastructure your backends run on are a critical aspect of your environment, and require unique monitoring tools and techniques. In this webinar learn about what DataOps is, and how critical good data ops is to the integrity of your application. Intelligent APM for your data is critical to the success of modern applications. In this webinar you will learn:
The power of APM tailored for Data Operations
The importance of visibility into your data infrastructure
How AIOps makes data ops actionable
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
Synthesis Webcast with Eric Kavanagh and Tamr
DataOps is an emerging set of practices, processes, and technologies for building and automating data pipelines to meet business needs quickly. As these pipelines become more complex and development teams grow in size, organizations need better collaboration and development processes to govern the flow of data and code from one step of the data lifecycle to the next – from data ingestion and transformation to analysis and reporting.
DataOps is not something that can be implemented all at once or in a short period of time. DataOps is a journey that requires a cultural shift. DataOps teams continuously search for new ways to cut waste, streamline steps, automate processes, increase output, and get it right the first time. The goal is to increase agility and cycle times, while reducing data defects, giving developers and business users greater confidence in data analytic output.
This webcast examines how organizations adopt DataOps practices in the field. It will review results of an Eckerson Group survey that sheds light on the rate and scope of DataOps adoption. It will also describe case studies of organizations that have successfully implemented DataOps practices, the challenges they have encountered and benefits they’ve received.
Tune into our webcast to learn:
- User perceptions of DataOps
- The rate of DataOps adoption by industry and other demographic variables
- DataOps adoption by technique and component (i.e., agile, test automation, orchestration, continuous development/continuous integration)
- Key challenges organizations face with DataOps
- Key benefits organizations experience with DataOps
- Best practices in doing DataOps
- Case studies and anecdotes of DataOps at companies
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce you to SPICE - a Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools.
Building an Effective Data & Analytics Operating Model A Data Modernization G...Mark Hewitt
This is the age of analytics—information resulting from the systematic analysis of data.
Insights gained from applying data and analytics to business allows large and small organizations across diverse industries—be it healthcare, retail, manufacturing, financial, or others—to identify new opportunities, improve core processes, enable continuous learning and differentiation, remain competitive, and thrive in an increasingly challenging business environment.
The key to building a data-driven practice is a Data and Analytics Operating Model (D&AOM) which enables the organization to establish standards for data governance, controls for data flows (both within and outside the organization), and adoption of appropriate technological innovations.
Success measures of a data initiative may include:
• Creating a competitive advantage by fulfilling unmet needs,
• Driving adoption and engagement of the digital experience platform (DXP),
• Delivering industry standard data and metrics, and
• Reducing the lift on service teams.
This green paper lays out the framework for building and customizing an effective data and analytics operating model.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
In this session, Sergio covered the Lakehouse concept and how companies implement it, from data ingestion to insight. He showed how you could use Azure Data Services to speed up your Analytics project from ingesting, modelling and delivering insights to end users.
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
MDM, data quality, data architecture, and more. At the same time, combining these foundational data management approaches with other innovative techniques can help drive organizational change as well as technological transformation. This webinar will provide practical steps for creating a data foundation for effective digital transformation.
Agile & Data Modeling – How Can They Work Together?DATAVERSITY
A tenet of the Agile Manifesto is ‘Working software over comprehensive documentation’, and many have interpreted that to mean that data models are not necessary in the agile development environment. Others have seen the value of data models for achieving the other core tenets of ‘Customer Collaboration’ and ‘Responding to Change’.
This webinar will discuss how data models are being effectively used in today’s Agile development environment and the benefits that are being achieved from this approach.
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
With technological innovation and change occurring at an ever-increasing rate, it’s hard to keep track of what’s hype and what can provide practical value for your organization. Join this webinar to see the results of a recent DATAVERSITY survey on emerging trends in Data Architecture, along with practical commentary and advice from industry expert Donna Burbank.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
With the aid of any number of data management and processing tools, data flows through multiple on-prem and cloud storage locations before it’s delivered to business users. As a result, IT teams — including IT Ops, DataOps, and DevOps — are often overwhelmed by the complexity of creating a reliable data pipeline that includes the automation and observability they require.
The answer to this widespread problem is a centralized data pipeline orchestration solution.
Join Stonebranch’s Scott Davis, Global Vice President and Ravi Murugesan, Sr. Solution Engineer to learn how DataOps teams orchestrate their end-to-end data pipelines with a platform approach to managing automation.
Key Learnings:
- Discover how to orchestrate data pipelines across a hybrid IT environment (on-prem and cloud)
- Find out how DataOps teams are empowered with event-based triggers for real-time data flow
- See examples of reports, dashboards, and proactive alerts designed to help you reliably keep data flowing through your business — with the observability you require
- Discover how to replace clunky legacy approaches to streaming data in a multi-cloud environment
- See what’s possible with the Stonebranch Universal Automation Center (UAC)
What is observability and how is it different from traditional monitoring? How do we effectively monitor and debug complex, elastic microservice architectures? In this interactive discussion, we’ll answer these questions. We’ll also introduce the idea of an “observability pipeline” as a way to empower teams following DevOps practices. Lastly, we’ll demo cloud-native observability tools that fit this “observability pipeline” model, including Fluentd, OpenTracing, and Jaeger.
“TODAY, COMPANIES ACROSS ALL INDUSTRIES ARE BECOMING SOFTWARE COMPANIES.”
The familiar refrain is certainly true of the new-school, born-in-the-cloud set. But it can also apply to traditional enterprises that are reinventing themselves by coupling DevOps excellence with intelligent DataOps.
Data is everywhere, and delivering trustable data to anyone who needs it has become a challenge. But innovative technologies come to the rescue: through smart semantics, metadata management, auto-profiling, faceted search and collaborative data curation there is a way to establish a Wikipedia like approach for your data. Find out how Talend will help you to operationalize more data faster and increase data usage for everyone with an Enterprise Data Catalog
Chief Data Officer: DataOps - Transformation of the Business Data EnvironmentCraig Milroy
Data is now not only considered as an Asset for Competitive Advantage; but now a Strategic Asset for Competitive Survival. ..
The Chief Data Officer will lead the transformation of the Business Data Environment to enable DataOps. . .
Leveraging DataOps will enable the timely creation of “Data Products” for the Enterprise. .
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce you to SPICE - a Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools.
Building an Effective Data & Analytics Operating Model A Data Modernization G...Mark Hewitt
This is the age of analytics—information resulting from the systematic analysis of data.
Insights gained from applying data and analytics to business allows large and small organizations across diverse industries—be it healthcare, retail, manufacturing, financial, or others—to identify new opportunities, improve core processes, enable continuous learning and differentiation, remain competitive, and thrive in an increasingly challenging business environment.
The key to building a data-driven practice is a Data and Analytics Operating Model (D&AOM) which enables the organization to establish standards for data governance, controls for data flows (both within and outside the organization), and adoption of appropriate technological innovations.
Success measures of a data initiative may include:
• Creating a competitive advantage by fulfilling unmet needs,
• Driving adoption and engagement of the digital experience platform (DXP),
• Delivering industry standard data and metrics, and
• Reducing the lift on service teams.
This green paper lays out the framework for building and customizing an effective data and analytics operating model.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
In this session, Sergio covered the Lakehouse concept and how companies implement it, from data ingestion to insight. He showed how you could use Azure Data Services to speed up your Analytics project from ingesting, modelling and delivering insights to end users.
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
MDM, data quality, data architecture, and more. At the same time, combining these foundational data management approaches with other innovative techniques can help drive organizational change as well as technological transformation. This webinar will provide practical steps for creating a data foundation for effective digital transformation.
Agile & Data Modeling – How Can They Work Together?DATAVERSITY
A tenet of the Agile Manifesto is ‘Working software over comprehensive documentation’, and many have interpreted that to mean that data models are not necessary in the agile development environment. Others have seen the value of data models for achieving the other core tenets of ‘Customer Collaboration’ and ‘Responding to Change’.
This webinar will discuss how data models are being effectively used in today’s Agile development environment and the benefits that are being achieved from this approach.
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
With technological innovation and change occurring at an ever-increasing rate, it’s hard to keep track of what’s hype and what can provide practical value for your organization. Join this webinar to see the results of a recent DATAVERSITY survey on emerging trends in Data Architecture, along with practical commentary and advice from industry expert Donna Burbank.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
With the aid of any number of data management and processing tools, data flows through multiple on-prem and cloud storage locations before it’s delivered to business users. As a result, IT teams — including IT Ops, DataOps, and DevOps — are often overwhelmed by the complexity of creating a reliable data pipeline that includes the automation and observability they require.
The answer to this widespread problem is a centralized data pipeline orchestration solution.
Join Stonebranch’s Scott Davis, Global Vice President and Ravi Murugesan, Sr. Solution Engineer to learn how DataOps teams orchestrate their end-to-end data pipelines with a platform approach to managing automation.
Key Learnings:
- Discover how to orchestrate data pipelines across a hybrid IT environment (on-prem and cloud)
- Find out how DataOps teams are empowered with event-based triggers for real-time data flow
- See examples of reports, dashboards, and proactive alerts designed to help you reliably keep data flowing through your business — with the observability you require
- Discover how to replace clunky legacy approaches to streaming data in a multi-cloud environment
- See what’s possible with the Stonebranch Universal Automation Center (UAC)
What is observability and how is it different from traditional monitoring? How do we effectively monitor and debug complex, elastic microservice architectures? In this interactive discussion, we’ll answer these questions. We’ll also introduce the idea of an “observability pipeline” as a way to empower teams following DevOps practices. Lastly, we’ll demo cloud-native observability tools that fit this “observability pipeline” model, including Fluentd, OpenTracing, and Jaeger.
“TODAY, COMPANIES ACROSS ALL INDUSTRIES ARE BECOMING SOFTWARE COMPANIES.”
The familiar refrain is certainly true of the new-school, born-in-the-cloud set. But it can also apply to traditional enterprises that are reinventing themselves by coupling DevOps excellence with intelligent DataOps.
Data is everywhere, and delivering trustable data to anyone who needs it has become a challenge. But innovative technologies come to the rescue: through smart semantics, metadata management, auto-profiling, faceted search and collaborative data curation there is a way to establish a Wikipedia like approach for your data. Find out how Talend will help you to operationalize more data faster and increase data usage for everyone with an Enterprise Data Catalog
Chief Data Officer: DataOps - Transformation of the Business Data EnvironmentCraig Milroy
Data is now not only considered as an Asset for Competitive Advantage; but now a Strategic Asset for Competitive Survival. ..
The Chief Data Officer will lead the transformation of the Business Data Environment to enable DataOps. . .
Leveraging DataOps will enable the timely creation of “Data Products” for the Enterprise. .
Big data is a phenomenon brought about by rapid data growth, complex, new, and changing data types, and parallel technology advancements; it brings huge possibilities. By optimizing these enormous amounts of structured and unstructured data, CSPs are in a unique position to capture these opportunities and create new revenue streams.
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
Big data technologies are being applied to a wide variety of use cases. We will review tangible examples of machine learning, discuss an autonomous driving project and illustrate the role of MapR in next generation initiatives. More: http://info.mapr.com/WB_Machine-Learning-for-Chickens_Global_DG_17.11.02_RegistrationPage.html
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
As you walk into your office on Monday morning, before you've even had a chance to grab a cup of coffee, your CEO asks to see you. He's worried: both customer churn and fraudulent transactions have increased over the past 6 months. As Data Manager, you have 6 months to solve this problem.
As Data Manager, you know the challenges ahead:
- Multitudes of technology choices to make
- Building a team and solving the skill-set disconnect
- Data can be deceiving...
- Figuring out what the successful data product must be
Florian works in the “data” field since 01’, back when it was not yet big. He worked in successful startups in search engine, advertising, and gaming industries, holding various data or CTO roles. He started Dataiku in 2013, his first venture as a CEO, with the goal of alleviating the daily pains encountered by data teams all around.
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
Data Lake and the rise of the microservicesBigstep
By simply looking at structured and unstructured data, Data Lakes enable companies to understand correlations between existing and new external data - such as social media - in ways traditional Business Intelligence tools cannot.
For this you need to find out the most efficient way to store and access structured or unstructured petabyte-sized data across your entire infrastructure.
In this meetup we’ll give answers on the next questions:
1. Why would someone use a Data Lake?
2. Is it hard to build a Data Lake?
3. What are the main features that a Data Lake should bring in?
4. What’s the role of the microservices in the big data world?
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Sri Ambati
This talk was recorded in London on Oct 30, 2018 and can be viewed here: https://youtu.be/lk2NXurrwAA
This talk highlights the integration of Driverless AI with IBM Spectrum Conductor. The integration demonstrates how you can deploy, manage, and scale out to have multiple Driverless AI instances running within your cluster per user to help maximize the efficiency and security of the cluster. The integration includes failover for Driverless AI instances, so that users can continue to work without needing to find another host to start Driverless AI on. In addition, the integration of H2O Sparkling Water with IBM Spectrum Conductor as a notebook is highlighted; as well as the benefits of running H20 Sparkling water within the cluster to maximize your cluster utilization across different workloads.For both Driverless AI and H2O Sparkling Water, a demo will be provided and a future plan for the integrations is highlighted.
Bio: Kevin Doyle is the lead architect of IBM Spectrum Conductor at IBM, where he works with customers to deploy and manage all workloads; especially Spark and deep learning workloads to on-premise clusters. Kevin has been working on distributed computing, grid, cloud, and big data for the past five years with a focus on the management and lifecycle of workloads.
DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.
Using PySpark to Process Boat Loads of DataRobert Dempsey
Learn how to use PySpark for processing massive amounts of data. Combined with the GitHub repo - https://github.com/rdempsey/pyspark-for-data-processing - this presentation will help you gain familiarity with processing data using Python and Spark.
If you're thinking about machine learning and not sure if it can help improve your business, but want to find out, set up a free 20-minute consultation with us: https://calendly.com/robertwdempsey/free-consultation
New developers and teams are now polyglot :
- they use multiple programming languages (Java, Javascript, Ruby, ...)
- they use multiple persistence store (RDBMS, NoSQL, Hadoop)
In this talk you will learn about the benefits if being polyglot: use the good language or framework for the good cause, select the good persistence for specific constraints.
This presentation will show how developer could mix the Java platform with other technologies such as NodeJS and AngularJS to build application in a more productive way. This is also the opportunity to talk about the new Command Query Responsibility Segregation (CQRS) pattern to allow developers to be more effective and deliver the proper application to the user quicker.
This presentation was delivered during Devfest Nantes 2014
The Netflix recipe for migrating your organization from building a datacenter based product to a cloud based product. First presented at the Silicon Valley Cloud Computing Meetup "Speak Cloudy to Me" on Saturday April 30th, 2011
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.
Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.
In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.
Speaker:
Yaniv Rodenski, Senior Solutions Architect, Couchbase
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Come può .NET contribuire alla Data Science? Cosa è .NET Interactive? Cosa c'entrano i notebook? E Apache Spark? E il pythonismo? E Azure? Vediamo in questa sessione di mettere in ordine le idee.
Stay productive while slicing up the monolith Markus Eisele
DevNexus 2017
Microservices-based architectures are en-vogue. The last couple of
years we have learned how the thought-leaders implement them, and
every other week we have heard about how containers and
Platform-as-a-Service offerings make them ultimately happen.
The problem is that the developers are almost forgotten and left alone
with provisioning and continuous delivery systems, containers and
resource schedulers, and frameworks and patterns to help slice
existing monoliths. How can we get back in control and efficiently
develop them without having to provision complete production-like
environments locally, by hand?
All the new buzzwords, frameworks, and hyped tools have made us forget
ourselves—Java developers–and what it means to be productive and have
fun building systems. The problem that we set out to solve is: how can
we run real-world Microservices-based systems on our local development
machines, managing provisioning, and orchestration of potentially
hundreds of services directly from a single command line tool, without
sacrificing productivity enablers like hot code reloading and instant
turnaround time?
During this talk, you’ll experience first-hand how much fun it can be
to develop large-scale Microservices-based systems. You will learn a
lot about what it takes to fail fast and recover and truly understand
the power of a fully integrated Microservices development environment.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
2. What Data Pipelines are Made Off
• Big Data applications:
• Ingestion
• Storage
• Processing
• Serving
• Workflows
• Machine learning
• Data Sources and Destinations
• Tests?
• Schemas??
3. Archetypes of Data Pipelines Builders
• Exploratory workloads
• Data centric
• Simple Deployment
Data People (Data Scientist/
Analysts/BI Devs)
Software Developers
• Code centric
• Heavy on methodologies
• Heavy tooling
• Very complex deployment
4. Making Big Data Teams Scale
• Scaling teams is hard
• Scaling Big Data teams is harder
• Different mentality between data professionals/
engineers
• Mixture of technologies
• Data as integration point
• Often schema-less
• Lack of tools
5. Continuous Delivery
• Keep software in a production
ready state
• Test all the changes: unit,
integration
• Exercise deployments
• Faster feedback cycle
7. The case for CI/CD/DevOps in Big Data Projects
• Coordination: data engineers, analysts, business, ops
• Integrate and test critical jobs
• Complex infrastructure: multiple distributed systems
• Need to decouple cluster operation via APIs/DSLs
• DevOps team to manage cluster operations: scaling, monitoring,
deployment.
• Include CI/CD practices are part of the delivery process.
8.
9. How are these techniques
applicable to
Big Data applications?
10. What Do We Need for Deploying our apps?
• Source control system: Git, Hg, etc
• CI process to run tests and package app
• A repository to store packaged app
• A repository to store configuration
• An API/DSL to deploy to the cluster
• Mechanism to monitor the behaviour and performance of the app
11. Who are we?
Software developers with
years of Big Data experience
What do we want?
Simple and robust way to
deploy Big Data applications
How will we get it?
Write thousands of lines
of code on top of Mesos
12. Amaterasu - Simple Continually Deployed Data
Apps
• Amaterasu is the Shinto goddess of sun
• In the Japanese manga series Naruto
Amaterasu is a super-natural power in the
shape of a black flame that can only be
taken out by its Sender
• Started as a framework to reliably execute
Spark driver programs
13. Amaterasu - Simple Continually Deployed Data
Apps
• Big Data apps in Multiple Frameworks
(Currently Only Spark is Supported)
• Multiple Languages (soon)
• Workflow as YAML
• Simple to Write, easy to deploy
• Reliable execution (via Mesos)
• Multiple Environments
14. Big Data Pipeline Ops Requirements
• Support managing multiple distributed
technologies: Apache Spark, HDFS, Kafka,
Cassandra, etc.
• Treat data center as the OS while providing
resource isolation, scalability and fault tolerance.
• Ability to run multiple tasks per machine to
maximize utilization
15. Why Mesos?
• General purpose, battle tested cluster resource scheduler.
• Can run major modern Big Data systems: Hadoop, Spark,
Kafka, Cassandra
• Can deploys spark as part of the execution
• Supports scheduled and long running apps.
• Improves resource management and efficiency
• Great APIs
• DC/OS provides an even reacher environment
16. Amaterasu Repositories
• Jobs are defined in repositories
• Current implementation - git repositories
• Local directories support is planned for future release
• Repos structure
• maki.yml - The workflow definition
• src - a folder containing the actions (spark scripts, etc.) to be executed
• env - a folder containing configuration per environment
• Benefits of using git:
• Branching
• Tooling
18. Amaterasu is not a workflow engine,
it’s a deployment tool that understands that Big
Data applications are rarely deployed
independently of other Big Data applications
19. Actions DSL
• Your Scala/Future languages Spark code
• Few changes:
• Don’t create a new sc/sqlContext, use the one
in scope or access via AmaContext.sc and
AmaContext.sqlContext
• AmaContext.getDataFrame and
AmaContext.getRDD are used to access data
from previously executed actions
21. Environments
• Configuration is stored per environment
• Stored as JSON
• Contains:
• Spark master URI
• Input/output path
• Work dir
• User defined key-values
25. Future Development
• Continuous integration and test automation
• R, shell and Python support (R is already in progress)
• Extend environments to support:
• Full spark configuration (spark-defaults.conf, etc.)
• Extendable configuration model
• Better tooling
• DC/OS universe package
• Other frameworks: Flink, vowpal wabbit
• YARN?