The document discusses optimizing Spark machine learning pipelines. It describes using parallel model evaluation to speed up hyperparameter tuning by training multiple models simultaneously. This reduces the time spent on cross-validation for hyperparameter selection. The document also discusses optimizing tuning for pipeline models by treating the pipeline as a directed acyclic graph and parallelizing the fitting in breadth-first order to avoid duplicating work where possible.
As containerization continues to gain momentum and become a de facto standard for application deployment, challenges around containerization of big data workloads are coming to light. Great strides have been made within the open source communities towards running big data workloads in containers, but much is left to be done.
Apache Hadoop YARN is the modern distributed operating system for big data applications. It has morphed the Hadoop compute layer into a common resource-management platform that can host a wide variety of applications. At its core, YARN has a very powerful scheduler which enforces global cluster level invariants and helps sites manage user and operator expectations of elastic sharing, resource usage limits, SLAs, and more. YARN recently increased its support for Docker containerization and added a YARN service framework supporting long-running services.
In this session we will explore the emerging patterns and challenges related to containers and big data workloads, including running applications such as Apache Spark, Apache HBase, and Kubernetes in containers on YARN. BILLIE RINALDI, Principal Software Engineer, Hortonworks and SHANE KUMPF, Software Engineer, Hortonworks
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon-to-be-released Spark 2.3 features:
• New deployment mode: Kubernetes scheduler backend
• PySpark performance and enhancements
• New structured streaming execution engine: continuous processing
• Data source v2 APIs for both structured streaming and Spark SQL
• ML on structured streaming
• Image reader
• Stable codegen engine
• Spark History Server V2
• Native ORC support
• Vectorized ORC and SQL cache readers
• Stream-stream Join
• UDF enhancements
• Various SQL enhancements
Speakers
Xiao Li, Software Engineer, Databricks
Wenchen Fan, Software Engineer, Databricks
Adding structure to your streaming pipelines: moving from Spark streaming to ...DataWorks Summit
How do you go from a strictly typed object-based streaming pipeline with simple operations to a structured streaming pipeline with higher order complex relational operations? This is what the Data Engineering team did at GoPro to scale up the development of streaming pipelines for the rapidly growing number of devices and applications.
When big data frameworks such as Hadoop first came to exist, developers were happy because we could finally process large amounts of data without writing complex multi-threaded code or worse yet writing complicated distributed code. Unfortunately, only very simple operations were available such as map and reduce. Almost immediately, higher level operations were desired similar to relational operations. And so Hive and dozens (hundreds?) of SQL-based big data tools became available for more developer-efficient batch processing of massive amounts of data.
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world, so that nearly every streaming framework now supports higher level relational operations.
In this talk, we will discuss in a very hands-on manner how the streaming data pipelines for GoPro devices and apps have moved from the original Spark streaming with its simple RDD-based operations in Spark 1.x to Spark's structured streaming with its higher level relational operations in Spark 2.x. We will talk about the differences, advantages, and necessary pain points that must be addressed in order to scale relational-based streaming pipelines for massive IoT streams. We will also talk about moving from “hand built” Hadoop/Spark clusters running in the cloud to using a Spark-based cloud service. DAVID WINTERS, Big Data Architect, GoPro and HAO ZOU, Senior Software Engineer, GoPro
Insights into Real-world Data Management ChallengesDataWorks Summit
Oracle began with the belief that the foundation of IT was managing information. The Oracle Cloud Platform for Big Data is a natural extension of our belief in the power of data. Oracle’s Integrated Cloud is one cloud for the entire business, meeting everyone’s needs. It’s about Connecting people to information through tools which help you combine and aggregate data from any source.
This session will explore how organizations can transition to the cloud by delivering fully managed and elastic Hadoop and Real-time Streaming cloud services to built robust offerings that provide measurable value to the business. We will explore key data management trends and dive deeper into pain points we are hearing about from our customer base.
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
As containerization continues to gain momentum and become a de facto standard for application deployment, challenges around containerization of big data workloads are coming to light. Great strides have been made within the open source communities towards running big data workloads in containers, but much is left to be done.
Apache Hadoop YARN is the modern distributed operating system for big data applications. It has morphed the Hadoop compute layer into a common resource-management platform that can host a wide variety of applications. At its core, YARN has a very powerful scheduler which enforces global cluster level invariants and helps sites manage user and operator expectations of elastic sharing, resource usage limits, SLAs, and more. YARN recently increased its support for Docker containerization and added a YARN service framework supporting long-running services.
In this session we will explore the emerging patterns and challenges related to containers and big data workloads, including running applications such as Apache Spark, Apache HBase, and Kubernetes in containers on YARN. BILLIE RINALDI, Principal Software Engineer, Hortonworks and SHANE KUMPF, Software Engineer, Hortonworks
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon-to-be-released Spark 2.3 features:
• New deployment mode: Kubernetes scheduler backend
• PySpark performance and enhancements
• New structured streaming execution engine: continuous processing
• Data source v2 APIs for both structured streaming and Spark SQL
• ML on structured streaming
• Image reader
• Stable codegen engine
• Spark History Server V2
• Native ORC support
• Vectorized ORC and SQL cache readers
• Stream-stream Join
• UDF enhancements
• Various SQL enhancements
Speakers
Xiao Li, Software Engineer, Databricks
Wenchen Fan, Software Engineer, Databricks
Adding structure to your streaming pipelines: moving from Spark streaming to ...DataWorks Summit
How do you go from a strictly typed object-based streaming pipeline with simple operations to a structured streaming pipeline with higher order complex relational operations? This is what the Data Engineering team did at GoPro to scale up the development of streaming pipelines for the rapidly growing number of devices and applications.
When big data frameworks such as Hadoop first came to exist, developers were happy because we could finally process large amounts of data without writing complex multi-threaded code or worse yet writing complicated distributed code. Unfortunately, only very simple operations were available such as map and reduce. Almost immediately, higher level operations were desired similar to relational operations. And so Hive and dozens (hundreds?) of SQL-based big data tools became available for more developer-efficient batch processing of massive amounts of data.
In recent years, big data has moved from batch processing to stream-based processing since no one wants to wait hours or days to gain insights. Dozens of stream processing frameworks exist today and the same trend that occurred in the batch-based big data processing realm has taken place in the streaming world, so that nearly every streaming framework now supports higher level relational operations.
In this talk, we will discuss in a very hands-on manner how the streaming data pipelines for GoPro devices and apps have moved from the original Spark streaming with its simple RDD-based operations in Spark 1.x to Spark's structured streaming with its higher level relational operations in Spark 2.x. We will talk about the differences, advantages, and necessary pain points that must be addressed in order to scale relational-based streaming pipelines for massive IoT streams. We will also talk about moving from “hand built” Hadoop/Spark clusters running in the cloud to using a Spark-based cloud service. DAVID WINTERS, Big Data Architect, GoPro and HAO ZOU, Senior Software Engineer, GoPro
Insights into Real-world Data Management ChallengesDataWorks Summit
Oracle began with the belief that the foundation of IT was managing information. The Oracle Cloud Platform for Big Data is a natural extension of our belief in the power of data. Oracle’s Integrated Cloud is one cloud for the entire business, meeting everyone’s needs. It’s about Connecting people to information through tools which help you combine and aggregate data from any source.
This session will explore how organizations can transition to the cloud by delivering fully managed and elastic Hadoop and Real-time Streaming cloud services to built robust offerings that provide measurable value to the business. We will explore key data management trends and dive deeper into pain points we are hearing about from our customer base.
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Speaker
Siddharth Teotia, Dremio, Software Engineer
The challenge of computing big data for evolving digital business processes demands variety of computation techniques and engines (SQL, OLAP, time-series, graph, document store), but working in unified framework. A simple architecture of data transformations while ensuring the security, governance, and operational administration are the necessary critical components for enterprise production environments supporting day-to-day business processes. In this session, you will learn about best practices & critical components to ensure business value from latest production deployments. Hear how existing customers are using SAP Vora and the value they have achieved so far with this in-memory engine for distributed data processing. The session provides you with a clear understanding how SAP Vora and open source components like Apache Hadoop and Apache Spark offer an architecture that supports a wide variety of use cases and industries. You will also receive very useful insight where to find development resources, test drive demos, and general documentation.
A machine learning and data science pipeline for real companiesDataWorks Summit
Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors.
Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners?
We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works.
Speaker
Ray Harrison, Comcast, Enterprise Architect
Prashant Khanolkar, Comcast, Principal Architect Big Data
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformDataWorks Summit
Overview of USAAs decision drivers and journey to migrate from our IBM BigInsights Hadoop platform to Hortonworks Data Platform. Many obstacles challenged our ability to respond to demands of our business data needs and analytic capabilities. In addition to migrating 1.5 PB (500 TB useable) of data to our new HDP environment, we were introducing a new security model with Kerberos/AD integration, data governance, as well as many new HDP services that were unavailable within our BigInsights platform. We'll discuss overall scope of work for this year long journey and our approach in bringing enterprise adoption of a new Hadoop platform. We still have many efforts under way to further enhance our data delivery patterns, information governance process and procedures, and optimized consumption within our HDP platform, but we are now better positioned to provide a flexible, secure, and managed Hadoop platform with focused innovation to meet USAAs strategic initiatives.
Speaker
Lisa Coleman, USAA, Technical Architect
Robert Tucker, USAA, Software Developer & Integrator Lead
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Speaker
Davor Bonaci, Senior Software Engineer, Google
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years.
Inspired by the increasingly complex SQL queries run by the Presto user community, engineers at Facebook and Starburst have recently focused on cost-based query optimization. In this talk we will present the initial design and implementation of the CBO, support for connector-provided statistics, estimating selectivity, and choosing efficient query plans. Then, our detailed experimental evaluation will illustrate the performance gains for several classes of queries achieved thanks to the optimizer. Finally, we will discuss our future work enhancing the initial CBO and present the general Presto roadmap for 2018 and beyond.
Speakers
Kamil Bajda-Pawlikowski, Starburst Data, CTO & Co-Founder
Martin Traverso
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
Sherlock is an anomaly detection service built on top of Druid. It leverages EGADS (Extensible Generic Anomaly Detection System; github.com/yahoo/egads) to detect anomalies in time-series data. Users can schedule jobs on an hourly, daily, weekly, or monthly basis, view anomaly reports from Sherlock's interface, or receive them via email.
Sherlock has four major components: timeseries generation, EGADS anomaly detection, Redis backend and Spark Java UI. Timeseries generation involves building, validating, querying, parsing the Druid query. Parsed Druid response is then fed to EGADS anomaly detection component which detects and generates the anomaly reports for each input time-series data. Sherlock uses Redis backend to store jobs metadata, generated anomaly reports and persistent job queue for scheduling jobs, etc. Users can choose to have a clustered Redis or standalone Redis. Sherlock provides user interface built with Spark Java. The UI enables users to submit instant anomaly analysis, create, and launch detection jobs, view anomalies on a heatmap and on a graph. Jigarkumar Patel, Software Development Engineer I, Oath Inc. and, David Servose, Software Systems Engineer, Oath
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
I will share the vision and the production journey of how we build enterprise shared AI As A Service platforms with distributed deep learning technologies. Including those topics:
1) The vision of Enterprise Shared AI As A Service and typical AI services use cases at FinTech industry
2) The high level architecture design principles for AI As A Service
3) The technical evaluation journey to choose an enterprise deep learning framework with comparisons, such as why we choose Deep learning framework based on Spark ecosystem
4) Share some production AI use cases, such as how we implemented new Users-Items Propensity Models with deep learning algorithms with Spark,improve the quality , performance and accuracy of offer and campaigns design, targeting offer matching and linking etc.
5) Share some experiences and tips of using deep learning technologies on top of Spark , such as how we conduct Intel BigDL into a real production.
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Hadoop’s capabilities offer untapped potential for business insights but companies often get weighed down with DIY platforms and fail to keep up with the requirements. Join this Dell EMC session which will address this challenge with ready bundles to quickly deliver solutions for ETL offload, Single View, & IoT.
Get more value from your big data:
• Deploy big data applications faster
• Increase business agility
• Confidently deliver high performance and endless scale
• Improve IT operational efficiency
Speaker
Shawn Smith, Big Data Specialist, Dell EMC
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
Big Data and the Internet of Things (IoT) have forced businesses and the Federal Government to reevaluate their existing data strategies and adopt a more modern data architecture. With the advent of the connected data platform, migrating or building data-driven applications that take advantage of data-in-motion and data-at-rest can be a daunting journey to undertake. Scaling, reusability, and achieving operational agility are just some of the common pitfalls associated with existing software architectures. How do we embrace this paradigm shift? Adopting agile methodologies and emerging development practices such as Microservices and DevOps offer greater agility and operational efficiency enabling the government to rapidly build modern data-driven applications.
During this talk and demonstration, we will show how the federal government can unleash the true power of the connected data platform with modern data-driven applications.
Connected Data Platform:
• Hortonworks DataFlow
o Using Apache NiFi for capturing data at the edge of the data lake & managing the flow of data to the data platform
o Apache Storm for complex event processing and stream processing
• Hortonworks Data Platform
o Apache Accumulo for scalability and cell-level security
o Apache YARN for resource management
• Modern Data-Driven Applications
o Microservices: a software architecture practice for designing software applications as suites of independently deployable services, promoting componentization, single responsibility & scalability. Adopting a Microservices mindset enables the government to be technology agnostic: using the best tool or programming language for the job.
♣ Demoed REST API’s on-top of Apache Accumulo. (Spark-Java, AngularJS/Typescript)
o DevOps: A culture and practice that breaks down the silos found between development and operations teams in traditional software practices.
♣ CI / CD pipelines, automated build kick-offs using containers (Docker, Jenkins)
This talk will lay out a basic environment for promoting greater agility and operational efficiency for the federal government while taking advantage of a connected data platform.
Lessons learned running a container cloud on YARNDataWorks Summit
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes, YARN now supports running Docker containers alongside process containers. Coupled with the newly added support for long-running services on YARN, this allows a host of new possibilities.
In this talk, we'll present how to run a container cloud on YARN. Leveraging the support in YARN for Docker and long-running services, we can allow users to easily spin up sets of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications. We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, service discovery, etc.
Speaker
Billie Rinaldi, Principal Software Engineer I, Hortonworks
From determining the most convenient rider pickup points to predicting the fastest routes, Uber aims to use data-driven analytics to create seamless trip experiences. Within engineering, analytics inform decision-making processes across the board.
One of the distinct challenges for Uber is analyzing geospatial big data. City locations, trips, and event information, for instance, provide insights that can improve business decisions and better serve users. Geospatial data analysis is particularly challenging, especially in a big data scenario, such as computing how many rides start at a transit location, how many drivers are crossing state lines, and so on. For these analytical requests, we must achieve efficiency, usability, and scalability in order to meet user needs and business requirements.
To accomplish this, we use Hadoop, Hive, and Presto in our production environment to process the big data powering our interactive SQL engine. In this talk, we discuss our engineering effort to optimize geospatial queries in the whole Hadoop stack.
Speakers
Zhenxiao Luo, Engineering Manager, Uber
Lu Niu, Sr Software Engineer, Uber
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
Back in 2014, our team set out to change the way the world exchanges and collaborates with data. Our vision was to build a single tenant environment for multiple organisations to securely share and consume data. And we did just that, leveraging multiple Hadoop technologies to help our infrastructure scale quickly and securely.
Today Data Republic’s technology delivers a trusted platform for hundreds of enterprise level companies to securely exchange, commercialise and collaborate with large datasets.
Join Head of Engineering, Juan Delard de Rigoulières and Senior Solutions Architect, Amin Abbaspour as they share key lessons from their team’s journey with Hadoop:
* How a startup leveraged a clever combination of Hadoop technologies to build a secure data exchange platform
* How Hadoop technologies helped us deliver key solutions around governance, security and controls of data and metadata
* An evaluation on the maturity and usefulness of some Hadoop technologies in our environment: Hive, HDFS, Spark, Ranger, Atlas, Knox, Kylin: we've use them all extensively.
* Our bold approach to expose APIs directly to end users; as well as the challenges, learning and code we created in the process
* Learnings from the front-line: How our team coped with code changes, performance tuning, issues and solutions while building our data exchange
Whether you’re an enterprise level business or a start-up looking to scale - this case study discussion offers behind-the-scenes lessons and key tips when using Hadoop technologies to manage data governance and collaboration in the cloud.
Speakers:
Juan Delard De Rigoulieres, Head of Engineering, Data Republic Pty Ltd
Amin Abbaspour, Senior Solutions Architect, Data Republic
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
DataWorks Summit 2017 - Sydney Keynote
Madhu Kochar, Vice President, Analytics Product Development and Client Success, IBM
Data science holds the promise of transforming businesses and disrupting entire industries. However, many organizations struggle to deploy and scale key technologies such as machine learning and deep learning. IBM will share how it is making data science accessible to all by simplifying the use of a range of open source technologies and data sources, including high performing and open architectures geared for cognitive workloads.
This presentation describes some of the Open Source Ai projects we are working at the Center for Open Source, Data and AI Technologies (CODAIT), including Model Asset Exchange (MAX), Fabric for Deep Learning (FfDL) and Jupyter Enterprise Gateway.
How do you analyze a Petabyte of data?
The Spark Python API or PySpark exposes the Spark programming model to Python. Apache® Spark™ is open-source and is one of the most popular Big Data frameworks for scaling up your tasks in a cluster. It was developed to utilize distributed, in-memory data structures to improve data processing speeds for massive amounts of data.
We’ll also look into Spark SQL - Apache Spark’s module for working with structured data and MLlib - Apache Spark’s scalable machine learning library.
What will you learn?
Perform Big Data analysis with PySpark
Use SQL queries with DataFrames by using the Spark SQL module
Use Machine learning with MLlib library
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Speaker
Siddharth Teotia, Dremio, Software Engineer
The challenge of computing big data for evolving digital business processes demands variety of computation techniques and engines (SQL, OLAP, time-series, graph, document store), but working in unified framework. A simple architecture of data transformations while ensuring the security, governance, and operational administration are the necessary critical components for enterprise production environments supporting day-to-day business processes. In this session, you will learn about best practices & critical components to ensure business value from latest production deployments. Hear how existing customers are using SAP Vora and the value they have achieved so far with this in-memory engine for distributed data processing. The session provides you with a clear understanding how SAP Vora and open source components like Apache Hadoop and Apache Spark offer an architecture that supports a wide variety of use cases and industries. You will also receive very useful insight where to find development resources, test drive demos, and general documentation.
A machine learning and data science pipeline for real companiesDataWorks Summit
Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors.
Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners?
We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works.
Speaker
Ray Harrison, Comcast, Enterprise Architect
Prashant Khanolkar, Comcast, Principal Architect Big Data
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformDataWorks Summit
Overview of USAAs decision drivers and journey to migrate from our IBM BigInsights Hadoop platform to Hortonworks Data Platform. Many obstacles challenged our ability to respond to demands of our business data needs and analytic capabilities. In addition to migrating 1.5 PB (500 TB useable) of data to our new HDP environment, we were introducing a new security model with Kerberos/AD integration, data governance, as well as many new HDP services that were unavailable within our BigInsights platform. We'll discuss overall scope of work for this year long journey and our approach in bringing enterprise adoption of a new Hadoop platform. We still have many efforts under way to further enhance our data delivery patterns, information governance process and procedures, and optimized consumption within our HDP platform, but we are now better positioned to provide a flexible, secure, and managed Hadoop platform with focused innovation to meet USAAs strategic initiatives.
Speaker
Lisa Coleman, USAA, Technical Architect
Robert Tucker, USAA, Software Developer & Integrator Lead
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Speaker
Davor Bonaci, Senior Software Engineer, Google
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years.
Inspired by the increasingly complex SQL queries run by the Presto user community, engineers at Facebook and Starburst have recently focused on cost-based query optimization. In this talk we will present the initial design and implementation of the CBO, support for connector-provided statistics, estimating selectivity, and choosing efficient query plans. Then, our detailed experimental evaluation will illustrate the performance gains for several classes of queries achieved thanks to the optimizer. Finally, we will discuss our future work enhancing the initial CBO and present the general Presto roadmap for 2018 and beyond.
Speakers
Kamil Bajda-Pawlikowski, Starburst Data, CTO & Co-Founder
Martin Traverso
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
Sherlock is an anomaly detection service built on top of Druid. It leverages EGADS (Extensible Generic Anomaly Detection System; github.com/yahoo/egads) to detect anomalies in time-series data. Users can schedule jobs on an hourly, daily, weekly, or monthly basis, view anomaly reports from Sherlock's interface, or receive them via email.
Sherlock has four major components: timeseries generation, EGADS anomaly detection, Redis backend and Spark Java UI. Timeseries generation involves building, validating, querying, parsing the Druid query. Parsed Druid response is then fed to EGADS anomaly detection component which detects and generates the anomaly reports for each input time-series data. Sherlock uses Redis backend to store jobs metadata, generated anomaly reports and persistent job queue for scheduling jobs, etc. Users can choose to have a clustered Redis or standalone Redis. Sherlock provides user interface built with Spark Java. The UI enables users to submit instant anomaly analysis, create, and launch detection jobs, view anomalies on a heatmap and on a graph. Jigarkumar Patel, Software Development Engineer I, Oath Inc. and, David Servose, Software Systems Engineer, Oath
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
I will share the vision and the production journey of how we build enterprise shared AI As A Service platforms with distributed deep learning technologies. Including those topics:
1) The vision of Enterprise Shared AI As A Service and typical AI services use cases at FinTech industry
2) The high level architecture design principles for AI As A Service
3) The technical evaluation journey to choose an enterprise deep learning framework with comparisons, such as why we choose Deep learning framework based on Spark ecosystem
4) Share some production AI use cases, such as how we implemented new Users-Items Propensity Models with deep learning algorithms with Spark,improve the quality , performance and accuracy of offer and campaigns design, targeting offer matching and linking etc.
5) Share some experiences and tips of using deep learning technologies on top of Spark , such as how we conduct Intel BigDL into a real production.
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Hadoop’s capabilities offer untapped potential for business insights but companies often get weighed down with DIY platforms and fail to keep up with the requirements. Join this Dell EMC session which will address this challenge with ready bundles to quickly deliver solutions for ETL offload, Single View, & IoT.
Get more value from your big data:
• Deploy big data applications faster
• Increase business agility
• Confidently deliver high performance and endless scale
• Improve IT operational efficiency
Speaker
Shawn Smith, Big Data Specialist, Dell EMC
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
Big Data and the Internet of Things (IoT) have forced businesses and the Federal Government to reevaluate their existing data strategies and adopt a more modern data architecture. With the advent of the connected data platform, migrating or building data-driven applications that take advantage of data-in-motion and data-at-rest can be a daunting journey to undertake. Scaling, reusability, and achieving operational agility are just some of the common pitfalls associated with existing software architectures. How do we embrace this paradigm shift? Adopting agile methodologies and emerging development practices such as Microservices and DevOps offer greater agility and operational efficiency enabling the government to rapidly build modern data-driven applications.
During this talk and demonstration, we will show how the federal government can unleash the true power of the connected data platform with modern data-driven applications.
Connected Data Platform:
• Hortonworks DataFlow
o Using Apache NiFi for capturing data at the edge of the data lake & managing the flow of data to the data platform
o Apache Storm for complex event processing and stream processing
• Hortonworks Data Platform
o Apache Accumulo for scalability and cell-level security
o Apache YARN for resource management
• Modern Data-Driven Applications
o Microservices: a software architecture practice for designing software applications as suites of independently deployable services, promoting componentization, single responsibility & scalability. Adopting a Microservices mindset enables the government to be technology agnostic: using the best tool or programming language for the job.
♣ Demoed REST API’s on-top of Apache Accumulo. (Spark-Java, AngularJS/Typescript)
o DevOps: A culture and practice that breaks down the silos found between development and operations teams in traditional software practices.
♣ CI / CD pipelines, automated build kick-offs using containers (Docker, Jenkins)
This talk will lay out a basic environment for promoting greater agility and operational efficiency for the federal government while taking advantage of a connected data platform.
Lessons learned running a container cloud on YARNDataWorks Summit
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes, YARN now supports running Docker containers alongside process containers. Coupled with the newly added support for long-running services on YARN, this allows a host of new possibilities.
In this talk, we'll present how to run a container cloud on YARN. Leveraging the support in YARN for Docker and long-running services, we can allow users to easily spin up sets of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications. We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, service discovery, etc.
Speaker
Billie Rinaldi, Principal Software Engineer I, Hortonworks
From determining the most convenient rider pickup points to predicting the fastest routes, Uber aims to use data-driven analytics to create seamless trip experiences. Within engineering, analytics inform decision-making processes across the board.
One of the distinct challenges for Uber is analyzing geospatial big data. City locations, trips, and event information, for instance, provide insights that can improve business decisions and better serve users. Geospatial data analysis is particularly challenging, especially in a big data scenario, such as computing how many rides start at a transit location, how many drivers are crossing state lines, and so on. For these analytical requests, we must achieve efficiency, usability, and scalability in order to meet user needs and business requirements.
To accomplish this, we use Hadoop, Hive, and Presto in our production environment to process the big data powering our interactive SQL engine. In this talk, we discuss our engineering effort to optimize geospatial queries in the whole Hadoop stack.
Speakers
Zhenxiao Luo, Engineering Manager, Uber
Lu Niu, Sr Software Engineer, Uber
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
Back in 2014, our team set out to change the way the world exchanges and collaborates with data. Our vision was to build a single tenant environment for multiple organisations to securely share and consume data. And we did just that, leveraging multiple Hadoop technologies to help our infrastructure scale quickly and securely.
Today Data Republic’s technology delivers a trusted platform for hundreds of enterprise level companies to securely exchange, commercialise and collaborate with large datasets.
Join Head of Engineering, Juan Delard de Rigoulières and Senior Solutions Architect, Amin Abbaspour as they share key lessons from their team’s journey with Hadoop:
* How a startup leveraged a clever combination of Hadoop technologies to build a secure data exchange platform
* How Hadoop technologies helped us deliver key solutions around governance, security and controls of data and metadata
* An evaluation on the maturity and usefulness of some Hadoop technologies in our environment: Hive, HDFS, Spark, Ranger, Atlas, Knox, Kylin: we've use them all extensively.
* Our bold approach to expose APIs directly to end users; as well as the challenges, learning and code we created in the process
* Learnings from the front-line: How our team coped with code changes, performance tuning, issues and solutions while building our data exchange
Whether you’re an enterprise level business or a start-up looking to scale - this case study discussion offers behind-the-scenes lessons and key tips when using Hadoop technologies to manage data governance and collaboration in the cloud.
Speakers:
Juan Delard De Rigoulieres, Head of Engineering, Data Republic Pty Ltd
Amin Abbaspour, Senior Solutions Architect, Data Republic
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
DataWorks Summit 2017 - Sydney Keynote
Madhu Kochar, Vice President, Analytics Product Development and Client Success, IBM
Data science holds the promise of transforming businesses and disrupting entire industries. However, many organizations struggle to deploy and scale key technologies such as machine learning and deep learning. IBM will share how it is making data science accessible to all by simplifying the use of a range of open source technologies and data sources, including high performing and open architectures geared for cognitive workloads.
This presentation describes some of the Open Source Ai projects we are working at the Center for Open Source, Data and AI Technologies (CODAIT), including Model Asset Exchange (MAX), Fabric for Deep Learning (FfDL) and Jupyter Enterprise Gateway.
How do you analyze a Petabyte of data?
The Spark Python API or PySpark exposes the Spark programming model to Python. Apache® Spark™ is open-source and is one of the most popular Big Data frameworks for scaling up your tasks in a cluster. It was developed to utilize distributed, in-memory data structures to improve data processing speeds for massive amounts of data.
We’ll also look into Spark SQL - Apache Spark’s module for working with structured data and MLlib - Apache Spark’s scalable machine learning library.
What will you learn?
Perform Big Data analysis with PySpark
Use SQL queries with DataFrames by using the Spark SQL module
Use Machine learning with MLlib library
Inteligencia artificial, open source e IBM Call for CodeLuciano Resende
Nesta palestra vamos abordar algumas das tendências em Inteligência Artificial e as dificuldades na uso da Inteligência Artificial. Por isso, também apresentaremos algumas ferramentas disponíveis em código livre que podem ajudar a simplificar a adoção da IA. E faremos uma breve introdução ao “Call for Code” que é uma iniciativa da IBM para construir soluções na prevenção e reação a desastres naturais.
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...Luciano Resende
The IBM Center for Open Source, Data and AI Technology "CODAIT" (https://developer.ibm.com/code/open/centers/codait/) works on multiple open-source Data and AI projects. In this section we will introduce these projects around Jupyter Notebooks, reusable Model and Data assets, Trusted AI among others.
How to build containerized architectures for deep learning - Data Festival 20...Antje Barth
When it comes to AI data scientists/engineers tend to focus on tools. Though the data platform that enables these tools is equally important, it’s often overlooked. In fact, 90% of the effort required for success in ML is not the algorithm – it’s the data logistics. In this workshop we will talk about common architecture blueprints to integrate AI in your data centers and how the right data platform choice can make all the difference in launching your AI use case into production! Presented at Data Festival Munich, 2019.
In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.
If you're like most of the world, you're on an aggressive race to implement machine learning applications and on a path to get to deep learning. If you can give better service at a lower cost, you will be the winners in 2030. But infrastructure is a key challenge to getting there. What does the technology infrastructure look like over the next decade as you move from Petabytes to Exabytes? How are you budgeting for more colossal data growth over the next decade? How do your data scientists share data today and will it scale for 5-10 years? Do you have the appropriate security, governance, back-up and archiving processes in place? This session will address these issues and discuss strategies for customers as they ramp up their AI journey with a long term view.
The common perception of applying deep learning is that you take an open source or research model, train it on raw data, and deploy the result as a fully self-contained artefact. The reality is far more complex.
For the training phase, users face an array of challenges including handling varied deep learning frameworks, hardware requirements and configurations, not to mention code quality, consistency, and packaging. For the deployment phase, they face another set of challenges, ranging from custom requirements for data pre- and postprocessing, inconsistencies across frameworks, and lack of standardization in serving APIs.
The goal of the IBM Developer Model Asset eXchange (MAX) is to remove these barriers to entry for developers to obtain, train, and deploy open source deep learning models for their business applications. In building the exchange, we encountered all these challenges and more.
For the training phase, we leverage the Fabric for Deep Learning (FfDL), an open source project providing framework-independent training of deep learning models on Kubernetes. For the deployment phase, MAX provides standardized container-based, fully self-contained model artifacts encompassing the end-to-end deep learning predictive pipeline.
This session introduces you to Amazon EC2 F1 instances and walks you through a typical development and deployment process, including the Approved Amazon EC2 F1 C/C++ development workflow. We also discuss a number of use cases in different domains, including financial risk simulation, genomics, video processing, big data and analytics, with a discussion about acceleration work on top of EC2 F1.
Enabling a hardware accelerated deep learning data science experience for Apa...DataWorks Summit
Deep learning techniques are finding significant commercial success in a wide variety of industries. Large unstructured data sets such as images, videos, speech and text are great for deep learning, but impose a lot of demands on computing resources. New types of hardware architectures such as GPUs and faster interconnects (e.g. NVLink), RDMA capable networking interface from Mellanox available on OpenPOWER and IBM POWER systems are enabling practical speedups for deep learning. Data Scientists can intuitively incorporate deep learning capabilities on accelerated hardware using open source components such as Jupyter and Zeppelin notebooks, RStudio, Spark, Python, Docker, and Kubernetes with IBM PowerAI. Jupyter and Apache Zeppelin integrate well with Apache Spark and Hadoop using the Apache Livy project. This session will show some deep learning build and deploy steps using Tensorflow and Caffe in Docker containers running in a hardware accelerated private cloud container service. This session will also show system architectures and best practices for deployments on accelerated hardware. INDRAJIT PODDAR, Senior Technical Staff Member, IBM
DevBCN Vertex AI - Pipelines for your MLOps workflowsMárton Kodok
In recent years, one of the biggest trends in applications development has been the rise of Machine Learning solutions, tools, and managed platforms. Vertex AI is a managed unified ML platform for all your AI workloads. On the MLOps side, Vertex AI Pipelines solutions let you adopt experiment pipelining beyond the classic build, train, eval, and deploy a model. It is engineered for data scientists and data engineers, and it’s a tremendous help for those teams who don’t have DevOps or sysadmin engineers, as infrastructure management overhead has been almost completely eliminated. Based on practical examples we will demonstrate how Vertex AI Pipelines scores high in terms of developer experience, how fits custom ML needs, and analyze results. It’s a toolset for a fully-fledged machine learning workflow, a sequence of steps in the model development, a deployment cycle, such as data preparation/validation, model training, hyperparameter tuning, model validation, and model deployment. Vertex AI comes with all classic resources plus an ML metadata store, a fully managed feature store, and a fully managed pipelines runner. Vertex AI Pipelines is a managed serverless toolkit, which means you don't have to fiddle with infrastructure or back-end resources to run workflows.
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsNick Pentreath
The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.
This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. While MLlib’s DataFrame API is powerful and elegant, it is relatively ill-suited to the needs of many real-time predictive applications, in part because it is tightly coupled with the Spark SQL runtime. In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open and standardized deployment of data science pipelines & analytic applications.
I’ll also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA.
A Look Under the Hood of H2O Driverless AI, Arno Candel - H2O World San Franc...Sri Ambati
This session was recorded in San Francisco on February 4th, 2019 and can be viewed here: https://youtu.be/oQfFPPUg5t8
Bio: Arno Candel is the Chief Technology Officer at H2O.ai. He is the main committer of H2O-3 and Driverless AI and has been designing and implementing high-performance machine-learning algorithms since 2012. Previously, he spent a decade in supercomputing at ETH and SLAC and collaborated with CERN on next-generation particle accelerators.
Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He was named “2014 Big Data All-Star” by Fortune Magazine and featured by ETH GLOBE in 2015. Follow him on Twitter: @ArnoCandel.
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...Amazon Web Services
In this session, we cover best practices for enterprises that want to use powerful open-source technologies to simplify and scale their machine learning (ML) efforts. Learn how to use Apache Spark, the data processing and analytics engine commonly used at enterprises today, for data preparation as it unifies data at massive scale across various sources. We train models using TensorFlow, and we use MLflow to track experiment runs between multiple users within a reproducible environment. We then manage the deployment of models to production. We show you how MLflow can be used with any existing ML library and incrementally incorporated into an existing ML development process. This session is brought to you by AWS partner, Databricks.
How to deploy machine learning models into productionDataWorks Summit
Data scientists spend a lot of time on data cleaning and munging, so that they can finally start with the fun part of their job: building models. After you have engineered the features and tested different models, you see how the prediction performance improves. However, the job is not done when you have a high performing model. The deployment of your models is a crucial step in the overall workflow and it is the point in time when your models actually become useful to your company.
In this session you will learn about various possibilities and best practices to bring machine learning models into production environments. The goal is not only to make live prediction calls or have the models available as REST API, but also what needs to be considered to maintain them. This talk will focus on solutions with Python (flask, Cloud Foundry, Docker, and more) and the well established ML packages such as Spark MLlib, scikit-learn, and xgboost, but the concepts can be easily transferred to other languages and frameworks.
Speaker
Sumit Goyal, IBM, Software Engineer
Similar to Optimizing your SparkML pipelines using the latest features in Spark 2.3 (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
55. Call for Code inspires developers
to solve pressing global
problems with sustainable
software solutions, delivering
on their vast potential to do good.
Bringing together NGOs, academic
institutions, enterprises, and
startup developers to compete
build effective disaster mitigation
solutions, with a focus on health
and well-being.
International Federation of Red
Cross/Red Crescent, The
American Red Cross, and the
United Nations Office of Human
Rights combine for the Call for
Code Award to elevate the profile
of developers.
Award winners will receive long-term
support through open source
foundations, financial prizes, the
opportunity to present their solution
to leading VCs, and will deploy their
solution through IBM’s Corporate
Service Corps.
Developers will jump-start their project
with dedicated IBM Code Patterns,
combined with optional enterprise
technology to build projects over the
course of three months.
Judged by the world’s most renowned
technologists, the grand prize will be
presented in October at an Award
Event.
developer.ibm.com/callforcode
Arrow is an in memory columnar data format
Format definition is language agnostic
Have libraries implemented in several key languages with more coming
Looking at arrow from a high-level point of view, it’s not just for transferring data from java to python.
Arrow can be a common way to bring together many different systems into the big data world, that might otherwise require a lot of specialized code to talk to the JVM.
Also, an important feature of Arrow for non-Java applications is that it can read/write parquet, which is a standard big data file format.
Let’s look in detail of how a python worker processes a UDF with and without Arrow
Now that we have an efficient way to transfer data to/from python, it becomes more practical to start integrating some of the great packages available in python.
Going back to the wordcount example, now we can create a pandas_udf that uses the python package spacy for NLP to do the tokenization, which will give us better tokenization with configurable languages.
Here, the input is a pandas series of text documents. We use Pandas to apply the spacy tokenizer to each document, and return a pandas series of string arrays.
Taking this a step further, we will probably want to connect it into a ML pipeline. Currently all of the existing Spark MLlib stages are Java based, so this creates a multi-language pipeline. To make this less painful, it would be really great to have a simple way to plugin our Python code.
Trying to hack up your own solution is difficult because it can be a lot of boilerplate code and it’s not always easy to work with the Spark ML framework.
Fortunately, there is an os project called Sparkling ML. This is a library extension to Spark that adds additional estimators/transformers and allows you to easily write your own that will fit in nicely with rest of Spark ML.
It also makes it easy to integrate a pure python stage into a standard Scala-based pipeline.
Once we have defined our spacy tokenization stage, we can use it to build a mixed language pipeline that fits right in with the rest of Spark.
So we are then able to use Spark SQL to feed our pipeline and tune it with the existing CrossValidator.