The document describes building a low-cost sample tracking system using G Suite and Jira Cloud. It discusses using current off-the-shelf technology to create a serverless solution, how low-cost solutions can accelerate academic research, and developing the minimum viable product through iterative delivery. Permission to learn new skills can help develop capabilities to address problems and move research forward.
Intro to H2O in Python - Data Science LASri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in Python at Data Science LA meetup on 1.19.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O's R/Python/Flow (Web) interfaces.
Jo-fai (or Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in R at SCU
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intro to Machine Learning with H2O and Python - DenverSri Ambati
Presentation at Comcast Denver 03.01.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intro to H2O Machine Learning in Python - Galvanize SeattleSri Ambati
Erin LeDell presents Intro to H2O Machine Learning in Python at Galvanize Seattle, 02.02.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intro to H2O in Python - Data Science LASri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in Python at Data Science LA meetup on 1.19.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O's R/Python/Flow (Web) interfaces.
Jo-fai (or Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in R at SCU
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intro to Machine Learning with H2O and Python - DenverSri Ambati
Presentation at Comcast Denver 03.01.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intro to H2O Machine Learning in Python - Galvanize SeattleSri Ambati
Erin LeDell presents Intro to H2O Machine Learning in Python at Galvanize Seattle, 02.02.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Ataas2016 - Big data hadoop and map reduce - new age tools for aid to test...Agile Testing Alliance
Big Data - Hadoop and MapReduce - new age tools for aid to testing and QA BigData with its slew of technologies and terms has been the most talked about area in last couple of years. This has evolved in Big Data Science, Analytics and now on the IoT and automation side. There is a need for testers and QA team to not only get used to this new age digital transformation area but at the same time embrace the technology to their own advantage. We have experimented and successfully used Big Data Technologies – Hadoop and MapReduce for a recent testing engagement. The actual application was implemented using classic technologies like CentOS and C++. Testing team implemented Hadoop and MapReduce to help in quick turnaround for the testing.
Arno Candel, Chief Architect, H2O.ai talks about what's new in H2O including all the new advancements in the algorithms.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
H2O World - H2O Deep Learning with Arno CandelSri Ambati
H2O World 2015
Tutorial scripts for R, Python are here:
https://github.com/h2oai/h2o-world-2015-training/tree/master/tutorials/deeplearning
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Amazon Web Services
"In this talk, hear about two high-performant research services developed and operated by the Computation Institute at the University of Chicago running on AWS. Globus.org, a high-performance, reliable, robust file transfer service, has over 10,000 registered users who have moved over 25 petabytes of data using the service. The Globus service is operated entirely on AWS, leveraging Amazon EC2, Amazon EBS, Amazon S3, Amazon SES, Amazon SNS, etc. Globus Genomics is an end-to-end next-gen sequencing analysis service with state-of-art research data management capabilities. Globus Genomics uses Amazon EC2 for scaling out analysis, Amazon EBS for persistent storage, and Amazon S3 for archival storage. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.
"
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati
Ken and Fonda will talk through how organizations are embracing open source machine learning and AI platforms and what strategies to use to make the transformation easier.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
deck from talk at YOW Data in Sydney, covers VariantSpark, custom Apache Spark Machine Learning library and also GT-Scan2 using AWS Lambda architecture for bioinformatics
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyJo-fai Chow
Joe recently teamed up with IBM and Aginity to create a proof of concept "Moneyball" app for the IBM Think conference in Vegas. The original goal was to prove that different tools (e.g. H2O, Aginity AMP, IBM Data Science Experience, R and Shiny) could work together seamlessly for common business use-cases. Little did Joe know, the app would be used by Ari Kaplan (the real "Moneyball" guy) to validate the future performance of some baseball players. Ari recommended one player to a Major League Baseball team. The player was signed the next day with a multimillion-dollar contract. This talk is about Joe's journey to a real "Moneyball" application.
Applying Testing Techniques for Big Data and HadoopMark Johnson
Testing “Big Data” can mean big time investment; several hours often spent just realize you made a simple typo. You fix the typo and then wait another couple hours for your script to hopefully this time run to completion. Even if the Big Data script or program ran to completion are you sure your data analysis is correct? Getting programs to run to completion and to assure functional accuracy per the requirements are some of the biggest hidden problems in big data today.
During this overview presentation we will first introduce unit and functional testing techniques and high level concepts to consider in the Hadoop Ecosystem. The second half of the presentation we will explore real testing examples using tools such as PigUnit, JUnit for UDF testing, BeeTest and Hive limited test data set testing.
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFSri Ambati
Hank Roark's presentation at Galvanize SF, 02.23.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Ataas2016 - Big data hadoop and map reduce - new age tools for aid to test...Agile Testing Alliance
Big Data - Hadoop and MapReduce - new age tools for aid to testing and QA BigData with its slew of technologies and terms has been the most talked about area in last couple of years. This has evolved in Big Data Science, Analytics and now on the IoT and automation side. There is a need for testers and QA team to not only get used to this new age digital transformation area but at the same time embrace the technology to their own advantage. We have experimented and successfully used Big Data Technologies – Hadoop and MapReduce for a recent testing engagement. The actual application was implemented using classic technologies like CentOS and C++. Testing team implemented Hadoop and MapReduce to help in quick turnaround for the testing.
Arno Candel, Chief Architect, H2O.ai talks about what's new in H2O including all the new advancements in the algorithms.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
H2O World - H2O Deep Learning with Arno CandelSri Ambati
H2O World 2015
Tutorial scripts for R, Python are here:
https://github.com/h2oai/h2o-world-2015-training/tree/master/tutorials/deeplearning
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Amazon Web Services
"In this talk, hear about two high-performant research services developed and operated by the Computation Institute at the University of Chicago running on AWS. Globus.org, a high-performance, reliable, robust file transfer service, has over 10,000 registered users who have moved over 25 petabytes of data using the service. The Globus service is operated entirely on AWS, leveraging Amazon EC2, Amazon EBS, Amazon S3, Amazon SES, Amazon SNS, etc. Globus Genomics is an end-to-end next-gen sequencing analysis service with state-of-art research data management capabilities. Globus Genomics uses Amazon EC2 for scaling out analysis, Amazon EBS for persistent storage, and Amazon S3 for archival storage. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.
"
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati
Ken and Fonda will talk through how organizations are embracing open source machine learning and AI platforms and what strategies to use to make the transformation easier.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
deck from talk at YOW Data in Sydney, covers VariantSpark, custom Apache Spark Machine Learning library and also GT-Scan2 using AWS Lambda architecture for bioinformatics
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyJo-fai Chow
Joe recently teamed up with IBM and Aginity to create a proof of concept "Moneyball" app for the IBM Think conference in Vegas. The original goal was to prove that different tools (e.g. H2O, Aginity AMP, IBM Data Science Experience, R and Shiny) could work together seamlessly for common business use-cases. Little did Joe know, the app would be used by Ari Kaplan (the real "Moneyball" guy) to validate the future performance of some baseball players. Ari recommended one player to a Major League Baseball team. The player was signed the next day with a multimillion-dollar contract. This talk is about Joe's journey to a real "Moneyball" application.
Applying Testing Techniques for Big Data and HadoopMark Johnson
Testing “Big Data” can mean big time investment; several hours often spent just realize you made a simple typo. You fix the typo and then wait another couple hours for your script to hopefully this time run to completion. Even if the Big Data script or program ran to completion are you sure your data analysis is correct? Getting programs to run to completion and to assure functional accuracy per the requirements are some of the biggest hidden problems in big data today.
During this overview presentation we will first introduce unit and functional testing techniques and high level concepts to consider in the Hadoop Ecosystem. The second half of the presentation we will explore real testing examples using tools such as PigUnit, JUnit for UDF testing, BeeTest and Hive limited test data set testing.
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFSri Ambati
Hank Roark's presentation at Galvanize SF, 02.23.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The agenda for November included Microsoft Graph updates, PowerShell and the Microsoft Graph, Microsoft Graph data connect and how we built Cards4.Pro. Plus Q&A
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
The list of failed big data projects is long. They leave end-users, data analysts and data scientists frustrated with long lead times for changes. This case study will illustrate how to make changes to big data, models, and visualizations quickly, with high quality, using the tools teams love. We synthesize techniques from devOps, Demming, and direct experience.
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
The business cases for Hadoop can be made on the tremendous operational cost savings that it affords. But why stop there? The integration of R-powered analytics in Hadoop presents a totally new value proposition. Organizations can write R code and deploy it natively in Hadoop without data movement or the need to write their own MapReduce. Bringing R-powered predictive analytics into Hadoop will accelerate Hadoop’s value to organizations by allowing them to break through performance and scalability challenges and solve new analytic problems. Use all the data in Hadoop to discover more, grow more quickly, and operate more efficiently. Ask bigger questions. Ask new questions. Get better, faster results and share them.
Software engineering practices for the data science and machine learning life...DataWorks Summit
With the advent of newer frameworks and toolkits, data scientists are now more productive than ever and starting to prove indispensable to enterprises. Typical organizations have large teams of data scientists who build out key analytics assets that are used on a daily basis and an integral part of live transactions. However, there is also quite a lot of chaos and complexities that get introduced because of the state of the industry. Many packages used by data scientists are from open source, and even if they are well curated, there is a growing tendency to pick out the cutting-edge or unstable packages and frameworks to accelerate analytics. Different data scientists may use different versions of runtimes, different Python or R versions, or even different versions of the same packages. Predominantly data scientists work on their laptops and it becomes difficult to reproduce their environments for use by others. Since data science is now a team sport across multiple personas, involving non-practitioners, traditional application developers, execs, and IT operators, how does an enterprise create a platform for productive cross-role collaboration?
Enterprises need a very reliable and repeatable process, especially when it results in something that affects their production environments. They also require a well managed approach that enables the graduation of an asset from development through a testing and staging process to production. Given the pace of businesses nowadays, the process needs to be quite agile and flexible too—even enabling an easy path to reversing a change. Compliance and audit processes require clear lineage and history as well as approval chains.
In the traditional software engineering world, this lifecycle has been well understood and best practices have been followed for ages. But what does it mean when you have non-programmers or users who are not really trained in software engineering philosophies or who perceive all of this as "big process" roadblocks in their daily work ? How do you we engage them in a productive manner and yet support enterprise requirements for reliability, tracking, and a clear continuous integration and delivery practice? The presenters, in this session, will bring up interesting techniques based on their user research, real life customer interviews, and productized best practices. The presenters also invite the audience to share their stories and best practices to make this a lively conversation.
Speaker
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Azure DevOps offers many tools that you can choose from to augment your DevOps practices. Whether you are delivering software on-prem or in the cloud, building OSS or commercial solutions, using .NET, Java, Swift or any other language, you should see what Azure DevOps has to offer.
This presentation anchors best practices for Enterprise Data Science based on Microsoft's "Team Data Science Process". The talk includes introducing the concepts, describing some real-world advice for project planning, and discusses typical titles of professionals who make enterprise data science successful. These techniques also apply for AI (artificial intelligence), deep learning, machine learning, and advanced analytics.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
Leading organizations today all have data scientists and analytics teams. A key challenge is establishing cross-functional teams that can collaboratively derive insights from data and move exploratory interactive analytics into automated production systems. Boston Consulting Group, founded on quantitative decision making, guides global F500 companies in the technical and organizational structures that will provide a foundation for agility, innovation, and competitive advantage. This talk will outline key strategies for building effective cloud-native analytics teams.
Azure DevOps offers many tools that you can choose from to augment your DevOps practices. Whether you are delivering software on-prem or in the cloud, building OSS or commercial solutions, using .NET, Java, Swift or any other language, you should see what Azure DevOps has to offer.
A FAIR Approach to Publishing and Sharing Machine Learning ModelsBen Blaiszik
While there has been a significant increase in the amount of machine learning research across various domains of science, the processes to publish the results and make the resulting models and code available for reuse has been lacking. In this talk, we discuss FAIR data principles applied to machine learning models and how the Data and Learning Hub for Science (DLHub) can help make models more easily discoverable and usable in common scientific workflows. Visit https://www.dlhub.org for more information.
Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...Blackboard APAC
Blackboard has deployed cloud solutions for well over a decade and is very excited to launch our new SaaS offering at the Teaching and Learning conference. The session will explore Blackboard’s continued commitment to managed hosting, partnership with IBM/AWS and next generation SaaS offerings that offer institutions unique control over their innovation journey.
Are you planning to move existing applications to the cloud and want to avoid setbacks? These slides are from a webinar jointly presented by Atmosera and iTrellis, LLC. The webinar can help you find out how to assess your needs, plan out a migration and successfully operate your applications in a modern cloud environment. The webinar will provide the following answers:
* What re-platforming means and why you need to think about it
* How to take full advantage of a cloud such as Azure: agility, flexibility, and cost savings
* Lessons learned and best practices for planning a successful move to a modern cloud.
The full webinar playback URL is at https://www.atmosera.com/webinar-replatforming-application-cloud/
Similar to 2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking (20)
Perceptions of Project Managers in the Job Marketplace (and what to do about it)Bruce Kozuma
Given to the PMI Central Mass chapter on 2015-01-13. Can also be downloaded from here: https://pmicmass.org/document-repository/meetings-archive/2015-meetings-archive/381-2015-01-13-perceptions-of-pms-in-the-job-marketplace-bruce-kozuma
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
JMeter webinar - integration with InfluxDB and Grafana
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
1. Building a low-cost sample tracking system
with G Suite & Jira Cloud
What you can do with a little knowledge, a lot of ignorance,
some time, and permission to take a boondoggle
For Bio-IT World
2019/04/17 v1
2. About the Broad Institute of MIT and Harvard
• Propelling the understanding and
treatment of disease
• Collaborating deeply
• Reaching globally
• Empowering scientists
• Building partnerships
• Sharing data and knowledge
• Promoting inclusion
3. Take aways
• Current off-the-shelf technology allows for a serverless sample tracking solution (backed by a
lot of infrastructure)
• Low-cost solutions in academic research are good due to the effects of overhead and having
them removes finding sources of funding as a rate limiting factor for accelerating science
• Developing the Minimum Viable Product, along with short cycle/iterative delivery of solutions
to users, allows rapid feedback of what works to increase the velocity of science
• Making delivery deadlines on time builds faith that further iterations are worth the investment of
the project team’s time/focus
• Permission to invest time into learning a new skill not obviously in line with a job description
can move research forward by developing of new capabilities to apply to problems
V
fungible
4. A little history…
• 2014: I arrive at the Broad to work on solutions for management of laboratoryscientific data,
divided into functions (graphic by Scott Sutherland)
5. A little history
• Turns out the biggest need: Where’s my stuff (i.e., samples, data)?
6. One view of sample tracking at the Broad
• The parable of the blind people and the elephantiformes
7. One view of sample tracking at the Broad
Sample lifecycle Activity Tracking systems
Before physical samples received Project launch
Find participantssamples
Ship sample kits to participants
Jira Cloud
Google SheetsGoogle Forms
Consent systems
Smart sheets
Before processing Store samples
Process samples prior to
sequencing
Bespoke LIMS, COTS lab data
management systems
Google Sheets
Jira Cloud
During processing (e.g.,
sequencing)
Sequencing at GP/elsewhere
Analysis by Proteomics
Bespoke LIMS, on-premises Jira
After processing Data analysis
Data transfer
Google Sheets, Jira Cloud
Aspera, Trello
After initial use Compare samples
Reuse samples
?
Consent systems
8. One view of sample tracking at the Broad
Sample lifecycle Activity Tracking systems
Before physical samples received Project launch
Find participantssamples
Ship sample kits to participants
Jira Cloud
Google SheetsGoogle Forms
Consent systems
Smart sheets
Before processing Store samples
Process samples prior to
sequencing
Bespoke LIMS, COTS lab data
management systems
Google Sheets
Jira Cloud
During processing (e.g.,
sequencing)
Sequencing at GP/elsewhere
Analysis by Proteomics
Bespoke LIMS, on-premises Jira
After processing Data analysis
Data transfer
Google Sheets, Jira Cloud
Aspera, Trello
After initial use Compare samples
Reuse samples
?
Consent systems
9. One view of sample tracking at the Broad
Sample lifecycle Activity Tracking systems
Before physical samples received Project launch
Find participantssamples
Ship sample kits to participants
Jira Cloud
Google SheetsGoogle Forms
Consent systems
Smart sheets
Before processing Store samples
Process samples prior to
sequencing
Bespoke LIMS, COTS lab data
management systems
Google Sheets
Jira Cloud
During processing (e.g.,
sequencing)
Sequencing at GP/elsewhere
Analysis by Proteomics
Bespoke LIMS, on-premises Jira
After processing Data analysis
Data transfer
Google Sheets, Jira Cloud
Aspera, Trello
After initial use Compare samples
Reuse samples
?
Consent systems
10. Components of a G Suite & Jira Cloud-based
sample tracking system
<name>@broadinstitute.org
<name>@broad.mit.edu
16. Suitable for all?
• Discovery: Can experimental techniques can
produce data to answer scientific questions
• Scale Discovery: Scaling experimental techniques
so they can more reliably produce data at high rate
• Data Production: Regularly producing experimental
data and producing quality control data
• Iterative Refinement: Refining production-scale processes
and some level of change management is expected to
ensure the quality of the data produced is maintained
or improved
Early stages
technology
development
(e.g., PRISM)
Platform
(e.g., DMX)
17. Projects helped so far…
Project Things tracked Approximate go live date
Comparative Medicine Can’t tell (Issue security!) 2018/01/01
Firehose to FireCloud Migration ~2800 2018/03/01
Regev Lab (scRSP) ~2700 2018/04/01
Archive of Lines in Artificial Societies ~500 2018/04/15
NeuroGAP-Psychosis Ship Log ~100 2018/05/01
External Compound Request ~50 2018/06/01
Microbial Omics ~950 2018/08/01
Data Map Expansion In planning stages
18. Common factor? Each of these groups is piloting solutions with
rapid iterations, applying Agile techniques to speed science
• Sheila Dodge's Dynamic Work Design paper
• Agile Academia (Broad Affinity Group)
• Kendra West's The Agile Laboratory Handbook
• Kendall Square Agilists & Agile Biotech Boston
19. Development principles
1. Move science forward
2. Usability to encourage people to use it!
3. Low cost (i.e., no Jira Cloud add-ons, no outside labor)
4. Solution sustainable beyond initial development team
5. Deliver solutions to users in short time frames and rapidly iterate
6. Users in control as much as possible for shape of solution (e.g., layout of Google Sheet,
which fields needed, columns in Jira Boards, etc.)
7. Have as little code as necessary/leave as much to other components as possible (e.g.,
VLOOKUP in Google Sheets)
8. Limit dependencies between components where possible (hah)
9. At least attempt to think about security (e.g., limit storage of credentials)
10. Document, document, document…
20. Why G Suite & Jira Cloud?
G Suite
• Already established at the Broad
• High level of user familiarityskill already exists
• Cost covered by overhead already*
• Users able to prototype solutions quickly
• Metadata datasets are small
• Adequate feature set, i.e., can persist data,
flexible data types (+/-)
• Can share outside Broad easily
• SaaSintegrated into BITS architecture
• Developer (me) had easily transferable
skillsexperience
• Lots of resources from which to learn and copy
Jira Cloud
• Already established at the Broad
• Some level of user familiarityskill already exists
• Cost covered by overhead already
• Developer able to prototype solutions quickly
• Metadata datasets are small
• Adequate feature set, i.e., configurable workflows,
separable workflows by item type, custom fields
• SaaSintegrated into BITS architecture
• Developer (me) had some transferable
skillsexperience and some good history to follow
• Lots of resources from which to learn and copy
21. Lots of resources to learn how to automate them!
• w3schools.com
• https://www.w3schools.com/js/default.asp
• Atlassian
• https://developer.atlassian.com/cloud/jira/platform/rest/v3/?utm_source=%2Fcloud%2Fjira%
2Fplatform%2Frest&utm_medium=302
• https://developer.atlassian.com/server/jira/platform/jira-rest-api-examples/
• Stack Overflow
• https://stackoverflow.com
• Style guides
• https://www.w3schools.com/js/js_conventions.asp
• https://google.github.io/styleguide/jsguide.html
22. Design considerations
• Put code as close to where it needs to be as possible
• Google Sheet: Code to add menus to Google Sheets
• Google Forms or Google Sheets: Code to call Google Apps Script Module
• Google Apps Script module: Code to do extract/transform/load from Google
Sheet, upload to Jira Cloud, link Issues in Jira Cloud, etc.
• Use Google Forms to design intake forms for collaborators
• Use Google Sheets to store data from Google Forms and data necessary in
Issues in Jira
• Use Google Groups to establish role accounts for G Suite and Jira Cloud
• Multiple Boards in Jira for different views of the same data
• Prefer using each component for what it does best
23. Keys for success (thus far)
• Not quarterly-driven pharma (time = $$$) so space to learn new things
• (Some) freedom to work on interesting and pressing issues
• Feasible to pick up required knowledge to deliver minimum viable product
• Culture of volunteerism (no one said I couldn’t work on it)
• Supportive environment for learning and applying agile techniques
• Iterative developmentfocus on minimum viable product
(or hang yourself)
24. Not all is copacetic
• Google Apps Script editor is a bit
primitive (I miss colors,
autocompletion)
• Using another editor seemingly
requires a lot of futzing that could be
better spent fixing bugs delivering
features
• No integration to GitHub
• Must remember to NOT paste
username:password into GitHub :-|
• Calling code in GAS modules is
sloooow
• Google Sheets configuration can be
brittle (must know a priori about
columns and sheets)
• Have to know a lot about Jira Cloud
configuration to connect with it (e.g.,
custom field code ID)
• Our Jira Cloud instance configuration
needs housecleaning
• Versioning to be refined
• Security to be refined
25. To do list
• GAS project setup refinements
• Need to not run in development mode
• Code changes
• Use API tokens per role account
• Checks on user permissions (does
Google Sheets user has access to the
Jira Cloud Project)
• Use Jira Cloud Webhook to call a Google
Cloud Function, then modify the passed
JSON object to call back to Jira Cloud to
extend Jira functionality (e.g., transition to a
new status once all required fields filled out)
• A developer’s guide would be useful
• Setup of role account via Google Group
and some BITS trickery
• Ensure role account given access to the
Jira Cloud Project
• What is the appropriate role in the Jira
Cloud Project/what permissions does it
need
• Add Google Group to BITS’ Google
automation
• What is best done in each component
• Remove password:username from code
before posting to GitHub (always!)
26. Silly things
• let vs var
• No warnings on changing values of const!?!?!
• Style guide, what’s that?
• Changing GAS library names == bad idea
27. Even with modest skills, you can still deliver value…
I started with OLD technical skills
• C++, Java, Perl, OLE Automation
• RCS, SourceSafe, ClearCase
• HP-UX, QNX, Red Hat (pre RHEL)
• Client server development
• ExcelWordOutlook automation VBA
• ERP, MES, ELN, SDMS, LIMS
• Jira administration
and ended up updating them
• Google Apps ScriptJavaScript
• Rudimentary GitHub
• Web development (GET, PUT)
• Debug in cloud
• G Suite, Jira REST API
• (Rudimentary) Google Cloud IAM
authentication with BITS infrastructure
28. Take aways
• Current off-the-shelf technology allows for a serverless sample tracking solution (backed by a
lot of infrastructure)
• Low-cost solutions in academic research are good due to the effects of overhead and having
them removes finding sources of funding as a rate limiting factor for accelerating science
• Developing the Minimum Viable Product, along with short cycle/iterative delivery of solutions
to users, allows rapid feedback of what works to increase the velocity of science
• Making delivery deadlines on time builds faith that further iterations are worth the investment of
the project team’s time/focus
• Permission to invest time into learning a new skill not obviously in line with a job description
can move research forward by developing of new capabilities to apply to problems
V
fungible
29. Acknowledgements
(even if they’d rather that their names not be listed)
• Broad Information Technology
Services (BITS)
• Scientific Computing Services
(SCS) group: Vicky Guo
(manager), Michelle Campo,
Eric Jones, Michael Kirby,
Anthony Losada, Peter Ragone,
Gordon Saksena
• Other BITS people:
Jared Bancroft, Lukas Karlsson,
Bill Mayo, Katie Shakun,
Andrew Teixeira, Elsa Tsao…
• Scientific collaborators
• Thomas Cleland, Danielle Dionne,
Joshua Gould, Zach Leber,
Yenarae Lee, Anna Neumann,
Jenna Pfiffner-Borges,
Anne Stevenson, Kendra West,
Alec Wysoker, Didi Vaz
• Broad alumni
• Sadiya Akasha, Marc Monnar,
Scott Rich
31. Learn Broad's Institute best practices using the Atlassian tools
(Since it was on the advertisement!)
• Integration with Broad infrastructure (Single Sign On mostly)
• Understand our environment and tailor approach to it
• Flexible and ever changing workforce (groups and personnel), i.e., graduate students, post
docs, Associates, outside collaborators, interns, normal turnover, new groups, refactored
groups…
• A collection of semi-independent entities with a common goal
• Training, training, training
• Jira 101: Aimed at those new to Jira Cloud
• Jira 102: Jira Cloud Board and Project Administration
• Jira 201: Advanced Boards, JQL, Importing Issues, Mass Change (planned)
• Jira 301: Jira Cloud Project Administration (in development)
• Jira 401: Integrating Jira Cloud (planned)
32. Learn Broad's Institute best practices using the Atlassian tools
(Since it was on the advertisement!)
• Standardish practices/models
• KISS, e.g., start simple (To Do, Doing, Done) and iterate (To Do, Doing, Checked, Done)
• Share as little as possible (what’s not possible to separate, e.g., Statuses, Issue Types, custom fields)
• Separate things as much as possible (e.g., Workflows, Notification Schemes)
• Keep things private by default (e.g., only add people to Projects instead of all users in your instance)
• Use a tool for what it is best suited to do, and not other things
• Sometimes ServiceNow, Trello, or Smartsheets might be better suited to people’s needs
• Deliver value in short increments
• Attempt to follow IT system management/software engineering best practices
• Ticketing system for Jira Cloud Project requests
• Change Requests for major changes
• Test plans for major changes