TensorFlow on Spark: A Deep Dive into Distributed Deep LearningEvans Ye
Deep Learning these days become the de-facto standard for data scientists to build data products especially for text and image specific problems. With GPU, deep learning can achieve 10-100X performance improvement compared to traditional CPU processing. That makes a huge difference and sometime can turn a business project from non-feasible to feasible.
In this talk, we'll dive deep into how Verizon Media(Yahoo) tackle on the problem of distributed deep learning. Firstly, we'll give you an overview of the Verizon Media(Yahoo) open sourced solution: TensorflowOnSpark. We'll also walk you through several distributed GPU training solutions and the difference between the system architectures. Secondly, a more lightweight DL on Spark solution is built by the team led by me which is more focus on usability, productivity, and flexibility. The solution utilizes several advanced PySpark features and is built around PySpark's developer friendly characteristics to make distributed DL easy as ever for data scientists.
Getting involved in world class software engineering tips and tricks to join ...Evans Ye
Trend Micro has been involved in Hadoop related Apache open source project for a long time. So far we've contributions separated in projects such as Hadoop, HBase, Pig and Bigtop. In this talk, I'll share you some features we developed and our experience on join the apache community. To be specific, the talk will be composed by following sections:
• My development in Apache Bigtop
• tips and tricks to join the community
• Apache Bigtop Status quo
• Feature preview on recent development - docker based hadoop provisioning
Let's make some contributions to open source projects and build up your personal influence to the digital world!
Edge to ai analytics from edge to cloud with efficient movement of machine dataTimothy Spann
This is my talk from DataWorks Summit Barcelona at 2pm on Thursday March 21, 2019.
https://dataworkssummit.com/barcelona-2019/session/edge-to-ai-analytics-from-edge-to-cloud-with-efficient-movement-of-machine-data/
Timothy Spann
Senior Solutions Engineer
Cloudera, formerly Hortonworks, Pivotal.
It shows how to run AI on edge devices, in NiFi flows and in CDSW.
With Dask and Numba, you can NumPy-like and Pandas-like code and have it run very fast on multi-core systems as well as at scale on many-node clusters.
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningEvans Ye
Deep Learning these days become the de-facto standard for data scientists to build data products especially for text and image specific problems. With GPU, deep learning can achieve 10-100X performance improvement compared to traditional CPU processing. That makes a huge difference and sometime can turn a business project from non-feasible to feasible.
In this talk, we'll dive deep into how Verizon Media(Yahoo) tackle on the problem of distributed deep learning. Firstly, we'll give you an overview of the Verizon Media(Yahoo) open sourced solution: TensorflowOnSpark. We'll also walk you through several distributed GPU training solutions and the difference between the system architectures. Secondly, a more lightweight DL on Spark solution is built by the team led by me which is more focus on usability, productivity, and flexibility. The solution utilizes several advanced PySpark features and is built around PySpark's developer friendly characteristics to make distributed DL easy as ever for data scientists.
Getting involved in world class software engineering tips and tricks to join ...Evans Ye
Trend Micro has been involved in Hadoop related Apache open source project for a long time. So far we've contributions separated in projects such as Hadoop, HBase, Pig and Bigtop. In this talk, I'll share you some features we developed and our experience on join the apache community. To be specific, the talk will be composed by following sections:
• My development in Apache Bigtop
• tips and tricks to join the community
• Apache Bigtop Status quo
• Feature preview on recent development - docker based hadoop provisioning
Let's make some contributions to open source projects and build up your personal influence to the digital world!
Edge to ai analytics from edge to cloud with efficient movement of machine dataTimothy Spann
This is my talk from DataWorks Summit Barcelona at 2pm on Thursday March 21, 2019.
https://dataworkssummit.com/barcelona-2019/session/edge-to-ai-analytics-from-edge-to-cloud-with-efficient-movement-of-machine-data/
Timothy Spann
Senior Solutions Engineer
Cloudera, formerly Hortonworks, Pivotal.
It shows how to run AI on edge devices, in NiFi flows and in CDSW.
With Dask and Numba, you can NumPy-like and Pandas-like code and have it run very fast on multi-core systems as well as at scale on many-node clusters.
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.
Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...DataWorks Summit
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but the overall cluster configuration can present some unespected issues that can compromise performances and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them. The presented use cases will refer to DL4J and ND4J on different Spark deployment modes (standalone, YARN, Kubernetes). The reference programming language for any code example would be Scala, but no preliminary Scala knowledge is mandatory in order to better understanding the presented topics.
Apache Bigtop has created the de-facto standard in how Hadoop-based stacks are developed, delivered, and managed. We are at it again! The track will present the composition of the next generation of in-memory computing stack that is completely built out of open-source components. The next generation of the Apache data processing stack will focus on in-memory and transactional processing of large amounts of data. We will also be talking about performance benefits that legacy data-processing software based on MapReduce, Hive, and similar, can derive from in-memory computing. This session will discuss and analyze the benefits of practicing Fast Data in the open.
Scott Callaghan from the Southern California Earthquake Center presented this deck in a recent Blue Waters Webinar.
"I will present an overview of scientific workflows. I'll discuss what the community means by "workflows" and what elements make up a workflow. We'll talk about common problems that users might be facing, such as automation, job management, data staging, resource provisioning, and provenance tracking, and explain how workflow tools can help address these challenges. I'll present a brief example from my own work with a series of seismic codes showing how using workflow tools can improve scientific applications. I'll finish with an overview of high-level workflow concepts, with an aim to preparing users to get the most out of discussions of specific workflow tools and identify which tools would be best for them."
Watch the video: http://wp.me/p3RLHQ-gtH
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
時間:2018-02-10 台灣資料工程協會 2018 第一季技術工作坊
講題:使用普羅米修斯打造全棧式監控與告警平台
Building Full Stack Monitor and Notification with Prometheus
身為管理混合式雲端基礎建設的維運人員,面對分散在不同監控平台的數據是否感到頭疼呢?身為開發者,您是否苦於欠缺歷史監控數據來除錯或排查程式效能問題呢?本次分享將從動機面開始說明為何需要全棧式監控與告警平台,接著介紹過去一季講者如何使用普羅米修斯(Prometheus)與 Grafana 針對網路層、實體機器、虛擬機器、容器、中介軟體層(Ex. Apache Cassandra、Apache Kafka、CNCF Fluentd)、應用程式層來建立資料串流(Data Pipeline)的監控儀表板。礙於無法展示真實公司的環境,本分享將使用 Docker Compose 進行全棧式監控與告警平台的概念,也藉此逐一介紹搭建全棧式監控與告警平台會用到哪些普羅米修斯(Prometheus)的各類資料蒐集器(Exporter)。
As a Hybrid Cloud Operator, are you tired of collecting monitor metrics from different monitor services? As a Developer, do you need historical application and infrastructure metrics to debug or improve application performance? In this talk, I'll first talk about why should we build Full Stack Monitor and Notification with Prometheus and Grafana. I'll share my personal experience about monitoring network devices, physical machines, virtual machines, docker containers, Middleware (Ex. Apache Cassandra, Apapche Kafka, CNCF Fluentd) and Application metrics. I'll demonstrate an End-to-End Data Pipeline Dashboard with Docker Compose examples and introduce different kinds of Prometheus Exporter used for different monitor targets.
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit
Today there are several compliance use cases — archiving, e-discovery, supervision + surveillance, to name a few — that appear naturally suited as Hadoop workloads but haven’t seen wide adoption. In this talk, we’ll discuss common limitations, how Apache Spark helps, and propose some new blueprints as to how to modernize this architecture and disrupt existing solutions. Additionally, we’ll discuss the rising role of Apache Spark in this ecosystem; leveraging machine learning and advanced analytics in a space that has traditionally been restricted to fairly rote reporting.
Using Anaconda to light up dark data. My talk given to the Berkeley Institute of Data Science describing Anaconda and the Blaze ecosystem for bringing a virtual analytical database to your data.
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
The only way to get where we need to be in security analysis is if we use Security Intelligence. This means working harder and understanding the big picture of your data.
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.
Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...DataWorks Summit
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but the overall cluster configuration can present some unespected issues that can compromise performances and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them. The presented use cases will refer to DL4J and ND4J on different Spark deployment modes (standalone, YARN, Kubernetes). The reference programming language for any code example would be Scala, but no preliminary Scala knowledge is mandatory in order to better understanding the presented topics.
Apache Bigtop has created the de-facto standard in how Hadoop-based stacks are developed, delivered, and managed. We are at it again! The track will present the composition of the next generation of in-memory computing stack that is completely built out of open-source components. The next generation of the Apache data processing stack will focus on in-memory and transactional processing of large amounts of data. We will also be talking about performance benefits that legacy data-processing software based on MapReduce, Hive, and similar, can derive from in-memory computing. This session will discuss and analyze the benefits of practicing Fast Data in the open.
Scott Callaghan from the Southern California Earthquake Center presented this deck in a recent Blue Waters Webinar.
"I will present an overview of scientific workflows. I'll discuss what the community means by "workflows" and what elements make up a workflow. We'll talk about common problems that users might be facing, such as automation, job management, data staging, resource provisioning, and provenance tracking, and explain how workflow tools can help address these challenges. I'll present a brief example from my own work with a series of seismic codes showing how using workflow tools can improve scientific applications. I'll finish with an overview of high-level workflow concepts, with an aim to preparing users to get the most out of discussions of specific workflow tools and identify which tools would be best for them."
Watch the video: http://wp.me/p3RLHQ-gtH
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
時間:2018-02-10 台灣資料工程協會 2018 第一季技術工作坊
講題:使用普羅米修斯打造全棧式監控與告警平台
Building Full Stack Monitor and Notification with Prometheus
身為管理混合式雲端基礎建設的維運人員,面對分散在不同監控平台的數據是否感到頭疼呢?身為開發者,您是否苦於欠缺歷史監控數據來除錯或排查程式效能問題呢?本次分享將從動機面開始說明為何需要全棧式監控與告警平台,接著介紹過去一季講者如何使用普羅米修斯(Prometheus)與 Grafana 針對網路層、實體機器、虛擬機器、容器、中介軟體層(Ex. Apache Cassandra、Apache Kafka、CNCF Fluentd)、應用程式層來建立資料串流(Data Pipeline)的監控儀表板。礙於無法展示真實公司的環境,本分享將使用 Docker Compose 進行全棧式監控與告警平台的概念,也藉此逐一介紹搭建全棧式監控與告警平台會用到哪些普羅米修斯(Prometheus)的各類資料蒐集器(Exporter)。
As a Hybrid Cloud Operator, are you tired of collecting monitor metrics from different monitor services? As a Developer, do you need historical application and infrastructure metrics to debug or improve application performance? In this talk, I'll first talk about why should we build Full Stack Monitor and Notification with Prometheus and Grafana. I'll share my personal experience about monitoring network devices, physical machines, virtual machines, docker containers, Middleware (Ex. Apache Cassandra, Apapche Kafka, CNCF Fluentd) and Application metrics. I'll demonstrate an End-to-End Data Pipeline Dashboard with Docker Compose examples and introduce different kinds of Prometheus Exporter used for different monitor targets.
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit
Today there are several compliance use cases — archiving, e-discovery, supervision + surveillance, to name a few — that appear naturally suited as Hadoop workloads but haven’t seen wide adoption. In this talk, we’ll discuss common limitations, how Apache Spark helps, and propose some new blueprints as to how to modernize this architecture and disrupt existing solutions. Additionally, we’ll discuss the rising role of Apache Spark in this ecosystem; leveraging machine learning and advanced analytics in a space that has traditionally been restricted to fairly rote reporting.
Using Anaconda to light up dark data. My talk given to the Berkeley Institute of Data Science describing Anaconda and the Blaze ecosystem for bringing a virtual analytical database to your data.
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
The only way to get where we need to be in security analysis is if we use Security Intelligence. This means working harder and understanding the big picture of your data.
Enterprise Approach towards Cost Savings and Enterprise AgilityNUS-ISS
Presented by Mr Poon See Hong, Deputy Director (Planning), Police Logistics Department, Singapore Police Force, at our 14th Architecture Community of Practice Forum on 21 Jul 2016.
Big Data Analytics (BDA) is rapidly turning out to be a significant global enterprise need. It aims to facilitate the storage, querying and analysis of enterprise big data, which is getting more complicated and time-consuming with traditional database technologies. Apache Hadoop is a well-known Open-source BDA enterprise solution which is seeing an annual application growth rate of 60% globally.
With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that allows organizations to efficiently rein in their big data business transactions. Hadoop is uniquely capable of storing, aggregating, querying and analyzing big data sources into formats that fuel new business insights. Organizations that embrace solution architectures focused on maximizing data-driven insights will put themselves in a position to drive more business, enhance productivity, maintain competitive edge or discover new and lucrative business opportunities. Over the coming years, Hadoop could be in a position to process more than half the world’s data.
To educate organizations about how best to leverage Apache Hadoop as a key component of their enterprise big data architecture, Innovative Management Services is pleased to host the 1st annual Open-BDA Hadoop Summit 2014 which is scheduled to be held on 18th & 19th November, 2014 at Marriott Hotel, Karachi.
Demystify big data data science
An overview of the shift to Data Science Platforms
The 3 critical components of a Data Science platform
Industries that are most likely to get disrupted and shift to Data Science
Characteristics of firms that get left behind the Data Science wave
Factors that push an industry towards Data Science
A brief overview of aspects of platform architecture beyond technology
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
Fred Oh's presentation for SNW Spring, Monday 4/2/12, 1:00–1:45PM
Unstructured data growth is in an explosive state, and has no signs of slowing down. Costs continue to rise along with new regulations mandating longer data retention. Moreover, disparate silos, multivendor storage assets and less than optimal use of existing assets have all contributed to ‘accidental architectures.’ And while they can be key drivers for organizations to explore incremental, innovative solutions to their data challenges, they may provide only short-term gain. Join us for this session as we outline the business benefits of a truly unified, integrated platform to manage all block, file and object data that allows enterprises can make the most out of their storage resources. We explore the benefits of an integrated approach to multiprotocol file sharing, intelligent file tiering, federated search and active archiving; how to simplify and reduce the need for backup without the risk of losing availability; and the economic benefits of an integrated architecture approach that leads to lowering TCSO by 35% or more.
Are you excited and want to learn Big Data Technologies? Do you feel that internet is loaded with free materials is complicated for a newbie?
There are many things that may go wrong when learning a new technology. Free internet material are sometimes can of worms for a beginner and training is advised for a jumpstart.
Open-BDA Big Data Hadoop Developer Training which is going to be held on 11th & 12th May 2015 @ Marriott Hotel Karachi, will cover everything you need to know to start a career in Hadoop technology and achieve expertise to a level where you can take certification exams with MAPR, Cloudera & Hortonworks with confidence. You can start as a beginner and this course will help you become a certified professional.
Building Hadoop Data Applications with Kite by Tom WhiteThe Hive
With a such a large number of components in the Hadoop ecosystem, writing Hadoop applications can be a big challenge for newcomers. In this talk Tom looks at best practices for building data applications that run on Hadoop, and introduces the Kite SDK, an open source project created at Cloudera with the goal of simplifying Hadoop application development by codifying many of these best practices.
Meet with Tom White:
Tom White is one of the foremost experts on Hadoop. He has been an Apache Hadoop committer since February 2007, and is a Member of the Apache Software Foundation. Tom is a software engineer at Cloudera, where he has worked, since its foundation, on the core distributions from Apache and Cloudera. Previously he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O’Reilly, java.net and IBM’s developerWorks, and has spoken at many conferences, including ApacheCon and OSCON. Tom has a B.A. in mathematics from the University of Cambridge and an M.A. in philosophy of science from the University of Leeds, UK. He currently lives in Wales with his family.
To Serve and Protect: Making Sense of Hadoop Security Inside Analysis
The Briefing Room with Dr. Robin Bloor and HP Security Voltage
Live Webcast September 22, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=45ece7082b1d7c2cc8179bc7a1a69ea5
Hadoop is rapidly becoming a development platform and dominant server environment, and organizations are keen to take advantage of its massively scalable – and relatively inexpensive – resources. It is not, however, without its limitations, and it often requires a contingent of complementary components in order to behave within an information architecture. One area often overlooked is security, a factor that, if not considered from the onset, can insert great risk when putting sensitive data in Hadoop.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses how security was never a design point for Hadoop and what organizations can do about it. He’ll be briefed by Sudeep Venkatesh of HP Security Voltage, who will explain the intricacies surrounding a secure Hadoop implementation. He will show how techniques like format-preserving and partial-field encryption can allow for analytics over protected data, with zero performance impact.
Visit InsideAnalysis.com for more information.
It introduces and illustrates use cases, benefits and problems for Kerberos deployment on Hadoop; how Token support and TokenPreauth can help solve the problems. It also briefly introduces Haox project, a Java client library for Kerberos.
As Hadoop becomes a critical part of Enterprise data infrastructure, securing Hadoop has become critically important. Enterprises want assurance that all their data is protected and that only authorized users have access to the relevant bits of information. In this session we will cover all aspects of Hadoop security including authentication, authorization, audit and data protection. We will also provide demonstration and detailed instructions for implementing comprehensive Hadoop security.
The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
Apache Spark is one of the most exciting, active, and talked about ASF projects today, but how should Spring developers and enterprise architects view it? Is it the second coming of the Bean spec, or just another shiny distraction? This talk will introduce Spark and its core concepts, the ecosystem of services on top of it, types of problems it can solve, similarities and differences from Hadoop, integration with Spring XD, deployment topologies, and an exploration of uses in enterprise. Concepts will be illustrated with several demos covering: the programming model with Spring/Java8, development experience, “realistic” infrastructure simulation with local virtual deployments, and Spark cluster monitoring tools.
Hot Technologies of 2013 with Robin Bloor, Rick Sherman and IBM
Live Webcast June 19, 2013
http://www.insideanalysis.com
The promise of Hadoop can be seen in all kinds of ways -- the proliferation of open source projects; the virtually limitless applications of Big Data; the sheer number of vendors getting involved. But the real value only comes from a mature environment, and that's Hadoop 2.0. What are the component parts of a robust solution? How are today's cutting-edge organizations leveraging the power of Big Data?
Register for this episode of Hot Technologies to hear veteran Analysts Dr. Robin Bloor of The Bloor Group, and Rick Sherman of Athena IT Solutions, as they offer perspective on how the Hadoop movement is shaping up. Larry Weber of IBM will then offer his take on the tools and architecture necessary to tackle the new challenges posed by Big Data. He'll discuss IBM's latest big data offerings including IBM InfoSphere BigInsights, IBM InfoSphere Streams, and IBM InfoSphere Data Explorer, and IBM's vision for simplifying an organization's big data journey.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
Help! I inherited a Drupal Site! - DrupalCamp Atlanta 2016Paul McKibben
You have found yourself newly-responsible for administering and updating a Drupal site created by somebody else, and you’re struggling. Maybe you’re new to Drupal and you’ve been thrown into the fire. Or maybe you’re experienced with Drupal but the site creator used an unfamiliar approach. Or even worse, perhaps the site was not built according to best practices, and you need to dig deep to figure out how it works and keep it updated. Whatever your situation, this presentation has something for you.
Aiming for automatic updates - Drupal Dev Days Lisbon 2018hernanibf
Drupal recents security updates resulted in many hours of work for different professionals involved in maintenance of Drupal websites from developers to operations teams.
New Drupal 8 release cycle is also requiring organisations to spend more time guaranteeing that their websites are following last minor core release so their sites are updated and ready to receive new features and security updates.
Nevertheless, even with the increasing required effort, we still don’t have an easy way to support automatic updates in Drupal core but options start to appear.
In this session I will talk about different possible alternatives that can minimize the effort to automatically update Drupal while still maintaining best practices in all the required phases.
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
This session will provide an executive overview of the Apache Hadoop ecosystem, its basic concepts, and its real-world applications. Attendees will learn how organizations worldwide are using the latest tools and strategies to harness their enterprise information to solve business problems and the types of data analysis commonly powered by Hadoop. Learn how various projects make up the Apache Hadoop ecosystem and the role each plays to improve data storage, management, interaction, and analysis. This is a valuable opportunity to gain insights into Hadoop functionality and how it can be applied to address compelling business challenges in your agency.
A FUTURE-FOCUSED DIGITAL PLATFORM WITH DRUPAL 8Phase2
https://www.youtube.com/watch?v=NCx0fx-FWSc
Breaking News: Al Jazeera Builds Future-focused Digital Platform with Drupal 8
Sep 28, 2016 at DrupalCon Dublin
This just in: Al Jazeera Media Network, a leading provider in news and media broadcasting, is investing in its future by building a global, multi-lingual, unified CMS platform to streamline the creation and personalized delivery of news on the newly released Drupal 8 platform. This story is still unfolding!
For a global media network like Al Jazeera, Drupal 8 provides the perfect base for internationalization, future growth, and flexibility. Al Jazeera required a platform that could unify several different content streams and support a complicated editorial workflow, allowing network wide collaboration and search.
In this talk, leaders from the Al Jazeera digital project will go “behind-the-scenes” of the network’s next generation publishing platform. Hear from the Al Jazeera Product Managers and Platform Experts about how the content needs driving the media business can map to the underpinnings of a unified publishing platform. We will explore the technical advantages of Drupal 8, as well as the digital strategy that informed the endeavor. You’ll learn:
● Why Al Jazeera Media Network decided to invest in Drupal 8 as an early adopter
● How to use Deploy, Multi-version, and Replication modules to support an enterprise content repository
● The implications of starting with Lightning as a base distribution
● How Al Jazeera Media Network transformed its editorial workflow with Drupal 8 tools
For anyone working in the digital publishing industry or considering using Drupal 8 for a platform, this session is a must-see!
This presentation answer a lot of your questions about PostgreSQL and the Red Hat Cluster Suite.
It reviews how you can create failover/standby capabilities with the following activities:
General PostgreSQL clustering options
Overview of Red Hat Cluster Service
Identification of candidate databases for clustering
Identification of hardware for clustering
Analysis of uptime requirements and data latency
Implementation of clustering
Testing of clustering
PostgreSQL installation tips for RHCS
Gradle is an open-source build automation tool focused on flexibility, build reproducibility and performance. Over the years, this tool has evolved and introduced new concepts and features around dependency management, publication and other aspects on build and release of artifacts for the Java platform.
Keeping up to date with all these features across several projects can be challenging. How do you make sure that all your projects can be upgraded to the latest version of Gradle? What if you have thousands of projects and hundreds of engineers? How can you abstract common tasks for them and make sure that new releases work as expected?
At Netflix, we built Nebula, a collection of Gradle plugins that helps engineers remove boilerplate in Gradle build files, and makes building software the Netflix way easy. This reduces the cognitive load on developers, allowing them to focus on writing code.
In this talk, I’ll share with you our philosophy on how to build JVM artifacts and the pieces that help us boost the productivity of engineers at Netflix. I’ll talk about:
- What is Nebula
- What are the common problems we face and try to solve
- How we distribute it to every JVM engineer
- How we ensure that Nebula/Gradle changes do not break builds so we can ship new features with confidence at Netflix
Similar to Building hadoop based big data environment (20)
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
In the past when Hadoop was born, the big data world were focusing on how to build systems that scales. Now the world has evolved. HBase hits 2.0, Cassandra hits 3.0, Hive hits 3.0, etc. When scalability is conquered, what's next? That’s right, usability comes into play. If we look back into the history, NoSQL is really just using divide and concur mechanism to tackle big data problems by trading off SQL capabilities. But once big data problem solved, we see more and more NoSQL and data processing engines start to build up SQL or SQL-like interfaces. Therefore, a generic SQL engine that provides core SQL capabilities such as query parsing, relational algebra, and query optimization starts to shine.
In this talk, I'll walk you through the architecture, functionality, and design concept of Apache Calcite. Notice that Calcite itself is not a database, but many well known systems already incorporate Calcite as a library. For instance, Hive, Drill, Druid, Phoenix, Apex, Flink, Storm, Samza, and more. To better illustrate how Calcite works, I'll choose some of the systems and describe how they adopt Calcite and which part is enhanced by Calcite. Furthermore, I'll talk about several features that Calcite provides such as query optimization, heterogeneous data source, materialized view, and Stream SQL. From user's perspective, knowing better how these systems work behind the scene equips you with more knowledge to chose a system that ultimately suits your needs.
The Apache Way: A Proven Way Toward SuccessEvans Ye
With innumerous successful Apache projects that dominate the big data world, the working model of Apache communities clearly deserved a study. In this talk, I'll walk you through how Apache communities and the Apache Software Foundation work generally. The whole thing behinds it is so called "The Apache Way".
For audience whose an engineer, I'll share with you why you should be part of the Apache family, how to do it, and what you can get from it. Moreover, I'll cover this with some actionable tips, and closing up with some career advices. For those being managers or at CXO level, I'll talk about some aspects on building engineering culture which can alternately pace your team and business toward success.
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
2. Who am I
• Evans Ye @
• Dumbo Team
• http://dumbointaiwan.blogspot.tw/
12/14/2013
Copyright 2013 Trend Micro Inc.
3. Agenda
• Building your own Hadoop version
• Hadoop Deployment
• Hadoop release engineering
• The development environment
• Bigtop puppet
12/14/2013
Copyright 2013 Trend Micro Inc.
4. Why Build our own version
• Add your own patch at any time
– From community perspective, they need to take care about
backward complicity,
which need much more time and effort on it.
• Fetch official patches in to current adopted version
– You may not upgrade your Hadoop version frequently,
But there’s a specific need for that patch.
• Flexibility, Business needed features
12/14/2013
Copyright 2013 Trend Micro Inc.
8. Brute force
• git clone
• Make some changes
• Builde binary tarball
How to do version control?
core-site.xml
hdfs-site.xml
mapred-site.xml
…
12/14/2013
Copyright 2013 Trend Micro Inc.
10. How bigtop helps you
• Apache Hadoop App developers:
– Run pseudo-distributed Hadoop cluster to test your code on.
• Vendors:
– Build your own Apache Hadoop distribution, customized from
Apache Bigtop bits.
• Packaging, Deployment, Integration Testing
12/14/2013
Copyright 2013 Trend Micro Inc.
12. Build
• Build hadoop-common (see BUILDING.txt)
– hadoop-common$ mvn package –Pdist,docs,src,native -Dtar
• Prepare your src tar in bigtop
• Bigtop$ make hadoop-rpm
12/14/2013
Copyright 2013 Trend Micro Inc.
17. Problems to solve
• Lots of nodes need to be configured
• Less human involved, less mistake made
• Configuration changed quite often
– adjust fair scheduler
– enable/disable short circuit
– try more performance improvement configurations
12/14/2013
Copyright 2013 Trend Micro Inc.
19. What is puppet ?
• A IT automation tool to help system administrators
automate the many repetitive tasks
• You need to only define the desired state
12/14/2013
Copyright 2013 Trend Micro Inc.
20. What is Hadooppet ?
• A general hadoop cluster deployment tool based on
puppet
• Kerberos / ldap auto configured
• A set of hadoop / kerberos management tool
• A set of sanity check scripts for trend hadoop related
services
• Manage configuration on puppetmaster
12/14/2013
Copyright 2013 Trend Micro Inc.
21. Design
• Abstract environment specific configurations in a single
configuration file
• setup.sh
–
–
–
–
–
–
12/14/2013
namenode_fqdns=(“dev1.example.com” “dev2.example.com”)
namenode_dirs=(“/name/1” “/name/2”)
namenode_heap=32g
map_slots=5
reduce_slots=3
…
Copyright 2013 Trend Micro Inc.
22. Benifits
• Can be used to setup any kind of hadoop cluster
• When doing main version upgarade, minimal the
downtime
– hadoop1 hadoop2
Namenode
Secondarynamenode
12/14/2013
Copyright 2013 Trend Micro Inc.
Active/Standby Namenode
Journalnodes
ZKFC
28. give-me-vm
• Pycon 2012
– Small Python Tools for Software Release Engineering
• An automation tool to manage
VM lifecycle
• Use Python XenAPI
• Create temporary VM for testing
by self service
• Destroy it when the testing
is finished
12/14/2013
Copyright 2013 Trend Micro Inc.
29. Build auto deployment on Hadooppet
• ./give_me_vm.py
• setup passphraseless ssh between each VM
• set hostname
• Install Hadooppet on master
• run deployment
• run sanity checks
• ./destroy_vm.py
12/14/2013
Copyright 2013 Trend Micro Inc.
32. For hadoop service developers…
• No enough hadoop client for each developers
• Developer can not reach server side while developing
hadoop related services
• Can not experiment new technology like impala spark
flume
• CI on Hadoop related services
12/14/2013
Copyright 2013 Trend Micro Inc.
33. give-me-vm + Hadoop all-in-one VM
• Use Hadooppet to setup a peudo-distributed hadoop
VM as Xenserver template
• get a Hadoop all-in-one VM via give-me-vm
• Services integrate its CI test with hadoop all-in-one VM
12/14/2013
Copyright 2013 Trend Micro Inc.
35. Bigtop puppet
• Bigtop also has a set of puppet scripts to deploy
Hadoop ecosystem
12/14/2013
Copyright 2013 Trend Micro Inc.
36. Bigtop puppet
• Preparation:
– A VM with jdk, puppet installed
– mkdir –p /data/{1,2}
– git clone https://github.com/apache/bigtop.git
12/14/2013
Copyright 2013 Trend Micro Inc.
37. Conclusion
• There’re many great deployment tool exist
– Ambari, CM, ETU appliance
– Choose suitable distribution by your business need
• If you want to do it by yourself
– Bigtop can do packaging for you easily
– Leverage bigtop puppet module for your deployment
12/14/2013
Copyright 2013 Trend Micro Inc.