An overview of the design, technical decisions, and implementation of the Arabidopsis Information Portal community-extensible data sharing and analytics platform.
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
We’re feeling the growing pains of maintaining a large data platform. Last year we went from 50 to 150 unique data feeds by adding them all by hand. In this talk we will share the best practices developed to handle our 300% increase in feeds through self service. Having self-service capabilities will increase your teams velocity and decrease your time to value and insight.
* Self service data feed design and ingest
* configuration management
* automatic debugging
* light weight data governance
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
Over the last few years, the Apache Hive community has been working on advancements to enable a full new range of use cases for the project, moving from its batch processing roots towards a SQL interactive query answering platform. Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the precomputation of relevant summaries or materialized views.
This talk presents our work on introducing materialized views and automatic query rewriting based on those materializations in Apache Hive. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations. We shall describe the current coverage of the rewriting algorithm, how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data, and outline interesting directions for future improvements. We include an experimental evaluation highlighting the benefits that the usage of materialized views can bring to the execution of Hive workloads.
Speaker
Jesus Camacho Rodriguez, Member of Technical Staff, Hortonworks
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...confluent
Last year, U.S. Citizenship and Immigration Services (USCIS) adopted a new strategy to accelerate our transition to a digital business model. This eProcessing strategy connects previously siloed technology systems to provide a complete digital experience that will shorten decision timelines, increase transparency, and more efficiently handle the 8 million requests for immigration benefits the agency receives each year.
To pursue this strategy effectively, we had to rethink and overhaul our IT landscape, one that has much in common with those other large enterprises in both the public and private sectors. We had to move away from antiquated ETL processes and overnight batch processing. And we needed to move away from the jumble of ESB, message queues, and spaghetti-stringed direct connections that were used for interservice communication.
Today, eProcessing is powered by real-time event streaming with Apache Kafka and Confluent Platform. We are building out our data mesh with microservices, CDC, and an event-driven architecture. This common core platform has reduced the cognitive load on development teams, who can now spend more time on delivering quality code and new features, less on DevSecOps and infrastructure activities. As teams have started to align around this platform, a culture of reusability has grown. We’ve seen a reduction in duplication of effort -- in some cases by up to 50% -- across the organization from case management to risk and fraud.
Join us at this session where we will share how we:
Used skunkworks projects early on to build internal knowledge and set the stage for the eProcessing form factory that would drive the digital transition at USCIS
Aggregated disparate systems around a common event-streaming platform that enables greater control without stifling innovation
Ensured compliance with FIPS 140-2 and other security standards that we are bound by
Developed working agreements that clearly defined the type of data a topic would contain, including any personally identifiable information requiring additional measures
Simplified onboarding and restricted jumpbox access with Jenkins jobs that can be used to create topics in dev and other environments
Implemented distributed tracing across all topics to track payloads throughout our entire domain structure
Started using KSQL to build streaming apps that extract relevant data from topics among other use cases
Supported grassroots efforts to increase use of the platform and foster cross-team communities that collaborate to increase reuse and minimize duplicated effort
Established a roadmap for federation with other agencies, that includes replacing SOAP, SFTP, and other outdata data-sharing approaches with Kafka event streaming
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
We’re feeling the growing pains of maintaining a large data platform. Last year we went from 50 to 150 unique data feeds by adding them all by hand. In this talk we will share the best practices developed to handle our 300% increase in feeds through self service. Having self-service capabilities will increase your teams velocity and decrease your time to value and insight.
* Self service data feed design and ingest
* configuration management
* automatic debugging
* light weight data governance
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
Over the last few years, the Apache Hive community has been working on advancements to enable a full new range of use cases for the project, moving from its batch processing roots towards a SQL interactive query answering platform. Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the precomputation of relevant summaries or materialized views.
This talk presents our work on introducing materialized views and automatic query rewriting based on those materializations in Apache Hive. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations. We shall describe the current coverage of the rewriting algorithm, how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data, and outline interesting directions for future improvements. We include an experimental evaluation highlighting the benefits that the usage of materialized views can bring to the execution of Hive workloads.
Speaker
Jesus Camacho Rodriguez, Member of Technical Staff, Hortonworks
How USCIS Powered a Digital Transition to eProcessing with Kafka (Rob Brown &...confluent
Last year, U.S. Citizenship and Immigration Services (USCIS) adopted a new strategy to accelerate our transition to a digital business model. This eProcessing strategy connects previously siloed technology systems to provide a complete digital experience that will shorten decision timelines, increase transparency, and more efficiently handle the 8 million requests for immigration benefits the agency receives each year.
To pursue this strategy effectively, we had to rethink and overhaul our IT landscape, one that has much in common with those other large enterprises in both the public and private sectors. We had to move away from antiquated ETL processes and overnight batch processing. And we needed to move away from the jumble of ESB, message queues, and spaghetti-stringed direct connections that were used for interservice communication.
Today, eProcessing is powered by real-time event streaming with Apache Kafka and Confluent Platform. We are building out our data mesh with microservices, CDC, and an event-driven architecture. This common core platform has reduced the cognitive load on development teams, who can now spend more time on delivering quality code and new features, less on DevSecOps and infrastructure activities. As teams have started to align around this platform, a culture of reusability has grown. We’ve seen a reduction in duplication of effort -- in some cases by up to 50% -- across the organization from case management to risk and fraud.
Join us at this session where we will share how we:
Used skunkworks projects early on to build internal knowledge and set the stage for the eProcessing form factory that would drive the digital transition at USCIS
Aggregated disparate systems around a common event-streaming platform that enables greater control without stifling innovation
Ensured compliance with FIPS 140-2 and other security standards that we are bound by
Developed working agreements that clearly defined the type of data a topic would contain, including any personally identifiable information requiring additional measures
Simplified onboarding and restricted jumpbox access with Jenkins jobs that can be used to create topics in dev and other environments
Implemented distributed tracing across all topics to track payloads throughout our entire domain structure
Started using KSQL to build streaming apps that extract relevant data from topics among other use cases
Supported grassroots efforts to increase use of the platform and foster cross-team communities that collaborate to increase reuse and minimize duplicated effort
Established a roadmap for federation with other agencies, that includes replacing SOAP, SFTP, and other outdata data-sharing approaches with Kafka event streaming
Spark Summit EU 2015: Matei Zaharia keynoteDatabricks
2015 was a year of continued growth for Spark, with numerous additions to the core project and very fast growth of use cases across the industry. In this talk, I’ll look back at how the Spark community is has grown and changed in 2015, based on a large Apache Spark user survey conducted by Databricks. We see some interesting trends in the diversity of runtime environments (which are increasingly not just Hadoop); the types of applications run on Spark; and the types of users, now that features like R support and DataFrames are available in Spark. I’ll also cover the ongoing work in the upcoming releases of Spark to support new use cases.
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
A quick introduction to Apache NiFi and it's ecosystem. Also a hands on demo on using processors, examining provenance, ingesting REST Feeds, XML, Cameras, Files, Running TensorFlow, Running Apache MXNet, integrating with Spark and Kafka. Storing to HDFS, HBase, Phoenix, Hive and S3.
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing.
In this talk you will learn more about:
1. A quick introduction to Kafka Core, Kafka Connect and Kafka Streams: What is and why?
2. Code and step-by-step instructions to build an end-to-end streaming data application using Apache Kafka
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerDataWorks Summit
Running scheduled, long-running or repetitive workflows on Hadoop clusters, especially secure clusters, is the domain of Apache Oozie. Oozie, however, suffers from XML for job configuration and a dated UI -- very bad usability in all. Apache Ambari, in its quest to make cluster management easier, has branched out to offering views for user services. This talk covers the Ambari Workflow Manager view which provides a GUI to author and visualize Oozie jobs.
To provide an example of Workflow Manager, Oozie jobs for log management and HBase compactions will be demonstrated showing off how easy Oozie can now be and what the exciting future for Oozie and Workflow Manager holds.
Apache Oozie is the long-time incumbent in big data processing. It is known to be hard to use and the interface is not aesthetically pleasing -- Oozie suffers from a dated UI. However, for secure Hadoop clusters, Oozie is the most readily available, obvious and full featured solution.
Apache Ambari is a deployment and configuration management tool used to deploy Hadoop clusters. Ambari Workflow Manager is a new Ambari view that helps address the usability and UI appeal of Apache Oozie.
In this talk, we’re going to leverage the stable foundation of Apache Oozie and clarity of Workflow Manager to demonstrate how one can build powerful batch workflows on top of Apache Hadoop. We’re also going to cover future roadmap and vision for both Apache Oozie and Workflow Manager. We will finish off with a live demo of Workflow Manager in action.
Speaker
Artem Ervits, Solutions Engineer, Hortonworks
Clay Baenziger, Hadoop Infrastructure, Bloomberg
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with an experimental evaluation highlighting the performant and powerful integration of these projects.
Speaker
Jesus Camancho Rodriquez, Hortonworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
These are the slides of my talk at the Chicago Apache Flink Meetup on April 19, 2016. This talk explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation, marks a new era of Real-Time and Real-World streaming analytics. The talk will map Flink's capabilities to streaming analytics use cases.
In this session we'll be looking at a number of different organisations who are on their big data cybersecurity journey with Apache Metron, we will take a look at the different usecases they are investigating, the data sources they used, the analytics they performed and in some cases the results they were able to find.
We'll also spend some time talking about the common themes in these projects, there are some common approaches to using Apache Metron as a phased project in a project, we'll review some of the common pitfalls and give some concrete suggestions about the things you should (and shouldn't) do when you're getting started.
Finally we'll try and tackle some of the key FAQ's that come up when people are first investigating the potential usage of Apache Metron in the real world based on over a year of interacting with customers and prospects as they look deeper into Apache Metron to see how it fits in to their cybersecurity portfolio.
Speaker
Dave Russell, Principal Solutions Engineer, Hortonworks
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...confluent
NASA's Deep Space Network (DSN) operates spacecraft communication links for NASA deep-space spacecraft missions, including the Curiosity Rover, the Voyager twin spacecraft, Galileo, New Horizons, etc., and has done so reliably for over fifty years. The DSN Complex Event Processing (DCEP) software assembly is a new software system being deployed worldwide into NASA's DSN Deep Space Communication Complexes (DSCC's), including facilities in Spain, Australia, and the United States. The system brings into the DSN next-generation "Big Data" and "Fast Data" infrastructural tools, including Apache Kafka, for correlating real-time network data with other critical data assets, including predicted antenna pointing parameters and extensive logging of physical hardware in the DSN. The ultimate use case is to ingest, filter, store, and visualize all of the DSN's monitor and control data and to actively ensure the successful DSN tracking, ranging, and communication integrity of dozens of concurrent deep-space missions. The system is also intended to support future autonomy applications, including automated anomaly detection in real-time network monitor streams and automated reconfiguration of antenna related assets as needed by future, increasingly autonomous spacecraft. This talk will focus upon the software system behind DCEP, and introduce novel approaches to increasing NASA spacecraft link-control operator cognizance into anomalies that may and do occur during spacecraft tracking activities. This talk will also offer lessons learned, and provide a glimpse into one of the most unique, "out-of-this-world", applications of Apache Kafka.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
Spark Summit EU 2015: Matei Zaharia keynoteDatabricks
2015 was a year of continued growth for Spark, with numerous additions to the core project and very fast growth of use cases across the industry. In this talk, I’ll look back at how the Spark community is has grown and changed in 2015, based on a large Apache Spark user survey conducted by Databricks. We see some interesting trends in the diversity of runtime environments (which are increasingly not just Hadoop); the types of applications run on Spark; and the types of users, now that features like R support and DataFrames are available in Spark. I’ll also cover the ongoing work in the upcoming releases of Spark to support new use cases.
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
A quick introduction to Apache NiFi and it's ecosystem. Also a hands on demo on using processors, examining provenance, ingesting REST Feeds, XML, Cameras, Files, Running TensorFlow, Running Apache MXNet, integrating with Spark and Kafka. Storing to HDFS, HBase, Phoenix, Hive and S3.
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing.
In this talk you will learn more about:
1. A quick introduction to Kafka Core, Kafka Connect and Kafka Streams: What is and why?
2. Code and step-by-step instructions to build an end-to-end streaming data application using Apache Kafka
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerDataWorks Summit
Running scheduled, long-running or repetitive workflows on Hadoop clusters, especially secure clusters, is the domain of Apache Oozie. Oozie, however, suffers from XML for job configuration and a dated UI -- very bad usability in all. Apache Ambari, in its quest to make cluster management easier, has branched out to offering views for user services. This talk covers the Ambari Workflow Manager view which provides a GUI to author and visualize Oozie jobs.
To provide an example of Workflow Manager, Oozie jobs for log management and HBase compactions will be demonstrated showing off how easy Oozie can now be and what the exciting future for Oozie and Workflow Manager holds.
Apache Oozie is the long-time incumbent in big data processing. It is known to be hard to use and the interface is not aesthetically pleasing -- Oozie suffers from a dated UI. However, for secure Hadoop clusters, Oozie is the most readily available, obvious and full featured solution.
Apache Ambari is a deployment and configuration management tool used to deploy Hadoop clusters. Ambari Workflow Manager is a new Ambari view that helps address the usability and UI appeal of Apache Oozie.
In this talk, we’re going to leverage the stable foundation of Apache Oozie and clarity of Workflow Manager to demonstrate how one can build powerful batch workflows on top of Apache Hadoop. We’re also going to cover future roadmap and vision for both Apache Oozie and Workflow Manager. We will finish off with a live demo of Workflow Manager in action.
Speaker
Artem Ervits, Solutions Engineer, Hortonworks
Clay Baenziger, Hadoop Infrastructure, Bloomberg
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with an experimental evaluation highlighting the performant and powerful integration of these projects.
Speaker
Jesus Camancho Rodriquez, Hortonworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
These are the slides of my talk at the Chicago Apache Flink Meetup on April 19, 2016. This talk explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation, marks a new era of Real-Time and Real-World streaming analytics. The talk will map Flink's capabilities to streaming analytics use cases.
In this session we'll be looking at a number of different organisations who are on their big data cybersecurity journey with Apache Metron, we will take a look at the different usecases they are investigating, the data sources they used, the analytics they performed and in some cases the results they were able to find.
We'll also spend some time talking about the common themes in these projects, there are some common approaches to using Apache Metron as a phased project in a project, we'll review some of the common pitfalls and give some concrete suggestions about the things you should (and shouldn't) do when you're getting started.
Finally we'll try and tackle some of the key FAQ's that come up when people are first investigating the potential usage of Apache Metron in the real world based on over a year of interacting with customers and prospects as they look deeper into Apache Metron to see how it fits in to their cybersecurity portfolio.
Speaker
Dave Russell, Principal Solutions Engineer, Hortonworks
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...confluent
NASA's Deep Space Network (DSN) operates spacecraft communication links for NASA deep-space spacecraft missions, including the Curiosity Rover, the Voyager twin spacecraft, Galileo, New Horizons, etc., and has done so reliably for over fifty years. The DSN Complex Event Processing (DCEP) software assembly is a new software system being deployed worldwide into NASA's DSN Deep Space Communication Complexes (DSCC's), including facilities in Spain, Australia, and the United States. The system brings into the DSN next-generation "Big Data" and "Fast Data" infrastructural tools, including Apache Kafka, for correlating real-time network data with other critical data assets, including predicted antenna pointing parameters and extensive logging of physical hardware in the DSN. The ultimate use case is to ingest, filter, store, and visualize all of the DSN's monitor and control data and to actively ensure the successful DSN tracking, ranging, and communication integrity of dozens of concurrent deep-space missions. The system is also intended to support future autonomy applications, including automated anomaly detection in real-time network monitor streams and automated reconfiguration of antenna related assets as needed by future, increasingly autonomous spacecraft. This talk will focus upon the software system behind DCEP, and introduce novel approaches to increasing NASA spacecraft link-control operator cognizance into anomalies that may and do occur during spacecraft tracking activities. This talk will also offer lessons learned, and provide a glimpse into one of the most unique, "out-of-this-world", applications of Apache Kafka.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
Scaling People, Not Just Systems, to Take On Big Data ChallengesMatthew Vaughn
Here, I describe how the Texas Advanced Computing Center has shifted its focus from traditional modeling and simulation towards fully embracing big data analytics performed by users with diverse technical backgrounds.
How Cyverse.org enables scalable data discoverability and re-useMatthew Vaughn
Cyverse.org designs, builds, and operates an innovative, integrated life sciences cyberinfrastructure. It provides data management and analysis capabilities with point and click, cloud, API, and command-line interfaces that engage users of any computing proficiency and is based on an extensible platform that integrates local and national-scale HPC, storage, and cloud resources. Cyverse directly supports thousands of users who store and access over 2PB of research data, use millions of compute hours annually, and participate in the platform's improvement, plus a secondary user community from partner projects that have built atop it. Cyverse is organized around "Data Store" and "App Catalog" services, each of which enables users to upload digital research assets that can be kept private, shared, or made public. Recently, Cyverse has been transitioning from passively enabling digital sharing towards active facilitation. It is partnering with repositories like NCBI SRA to enable direct submission from Cyverse applications, adopting commonly-used ontologies, enabling import/export of virtual machine images, developing metadata-driven persistent landing pages for data sets, and providing DOI (and other identifier) services. These new features are expected to further catalyze growth of an interoperable, interconnected network of shared research infrastructure across the biological sciences.
Aflac Provides many different options of voluntary benefits to ensure employees can make it through tough medical situations. All of these benefits are very affordable.
Etude E-marketing : Email mobile - maelle urbanmaelleurban
Comment l’e-mail marketing peut-il être adapté au terminal mobile, afin de maintenir et renforcer la relation client ?
Avec l’expansion des téléphones pouvant se connecter à l’Internet mobile, il devient essentiel pour les entreprises d’adapter leur stratégie à ce nouveau canal. En effet, l’Internet mobile permet de consulter ses e-mails, première motivation de connexion via le téléphone mobile. Or, il existe une multitude de terminaux mobiles ayant tous des caractéristiques technologiques différentes. La lecture de l’e-mail sur téléphone mobile devient alors parfois laborieuse. Or, un e-mail non consulté est un client potentiellement perdu. L’e-mail doit donc s’adapter rapidement à ce nouveau canal.
#Bornsocial Usages des médias sociaux par les moins de 13 ansheaven
La génération des moins de 13 ans va t-elle bouleverser le marketing sur les Médias Sociaux ?
Le 27 septembre 2016, une conférence organisée par l'agence en communication digitale Heaven Conseil a traité de la question des préados sur les réseaux sociaux et leur influence sur le marketing de demain.
Cette population des moins de 13 ans est particulièrement intéressante car elle n'est pas sensée pouvoir s'inscrire à ces plate-formes.
La plupart des chiffres présentés sont exclusifs et ont été obtenus en partenariat avec l'association Génération-Numérique.
ICAR 2015
Workshop 10 (TUESDAY, JULY 7, 2015, 4:30-6:00 PM)
The Arabidopsis information portal for users and developers
Matt Vaughn (Texas Advanced Computing Center)
Developing Apps: Exposing your data through Araport
Arabidopsis Information Portal, Developer Workshop 2014, IntroductionJasonRafeMiller
The Arabidopsis Information Portal (araport.org) is a resource for the plant genomics research community. The AIP conducts developer workshops to help other labs get involved. This presentation introduces the web site with a case study about contributing new module built around a legacy data set.
Cask Webinar
Date: 08/10/2016
Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0
In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit.
Some of the highlights include:
- Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger.
- Preview mode - Ability to preview and debug data pipelines before deploying them.
- Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines
- Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming.
- Data usage analytics - Ability to report application usage of data sets.
- And much more!
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataMatthew Vaughn
Araport is an innovative model organism database resource that offers users the ability to bring their own visualizations, data sets, algorithms, and genome browser tracks and share them with their colleagues.
IoT Physical Servers and Cloud Offerings.pdfGVNSK Sravya
Introduction to Cloud Storage models
• Communication APIs
• Webserver-Web server for IoT
• Cloud for IoT
• Python web application framework
• Designing a RESTful web API.
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays
apidays LIVE Hong Kong 2021 - API Ecosystem & Data Interchange
August 25 & 26, 2021
Multi-Protocol APIs at Scale in Adidas
Jesus de Diego, API Evangelist at Adidas
Interoperability in the Internet of Things is critical for emerging services and applications. In this presentation we advocate the use of IoT ‘hubs’ to aggregate things using web protocols, and suggest a staged approach to interoperability. In the context of a UK government funded project involving 8 IoT projects to address cross-domain IoT interoperability, we introduce the HyperCat IoT catalogue specification. We then describe the tools and techniques we developed to adapt an existing data portal and IoT platform to this specification, and provide an IoT hub focused on the highways industry called ‘Smart Streets’. Based on our experience developing this large scale IoT hub, we outline lessons learned which we hope will contribute to ongoing efforts to create an interoperable global IoT ecosystem.
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
PayPal currently processes tens of billions of signals per day from different sources in batch and streaming mode. The data processing platform is the one powering these different analytical needs and use cases, not just at PayPal but our adjacencies like Venmo, Hyperwallet and iZettle. End users of this platform demand access to data insights with as much flexibility as possible to explore it with low processing latency.
One such use case is where our Switchboard(data de-multiplexer) platform where we process approximately 20 billion events daily and provide data to different teams and platforms with-in PayPal and also to platform outside PayPal for more insights. When we started building this platform Kafka was just another asynchronous message processing platform for us but we have seen it evolving to a place where its adds value not just in terms of event processing but also for platform resiliency and scalability.
Takeaway for the audience: Most people work with and have knowledge about data. With this talk I want to present information which is relevant and meaningful to the audience. Information and examples which will make it easier for attendees to understand our complex system and hopefully have some practical takeaways to use Kafka for similar problems on their hand.
Tripal within the Arabidopsis Information Portal - PAG XXIIIVivek Krishnakumar
Araport plans to implement a Chado-backed data warehouse, fronted by Tripal, serving as as our core database, used to track multiple versions of genome annotation (TAIR10, Araport11, etc.), evidentiary data (used by our annotation update pipeline), metadata such as publications collated from multiple sources like TAIR, NCBI PubMed and UniProtKB (curated and unreviewed) and stock/germplasm data linked to AGI loci via their associated polymorphisms.
On-Demand Cloud Computing for Life Sciences Research and EducationMatthew Vaughn
The Jetstream cloud is a collaboration between Cyverse partners TACC and University of Arizona, University of Chicago, Johns Hopkins University, and Indiana University to bring the flexibility and ease-of-use of CyVerse Atmosphere to the entire community of science, at a much larger scale. Jetstream is a cloud resource operated as part of XSEDE, and built from two independent OpenStack clusters, each capable of supporting thousands of virtual machines and data volumes. The clusters are integrated via the user-friendly "Atmosphere" interface developed by CyVerse, with authentication enabled by Globus, and, unlike the CyVerse cloud also offer full access to Openstack web service APIs. Jetstream features a diverse catalog virtual machine templates. One can launch a personal Galaxy server, do advanced biostatistics, use Matlab, or experiment with new technologies like Docker, all on Jetstream. This talk highlights the unique capabilities of Jetstream and provides information about how researchers from all over can access it.
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingMatthew Vaughn
Intro slides from AKES workshop at ISMB2016. This workshop addresses the challenges and requirements for working effectively on cloud computing and high performance computing resources, discusses the key principles that should guide responsible scientific computation and collaboration, and using hands-on sessions presents practical solutions using emergent software tools that are becoming widely adopted in the global scientific community. Specifically, we will look at using “containers” to bundle software applications and their full execution environment in a portable way. We will look at managing and sharing data across distributed resources. And finally, we will tackle how to orchestrate job execution across systems and capture metadata on the results (and the process) so that parameters and methodologies are not lost.
Packaging computational biology tools for broad distribution and ease-of-reuseMatthew Vaughn
A typical instance of computational biology software is composed of interpreted code, compiled binaries, shared libraries, and shell scripts, sometimes mixed in with use of web services or databases, running in the context of a complex computer operating system, atop increasingly sophisticiated physical resources. How can we expect computations to be sharable and reproducible, and how can we hope to train people to use such resources? This talk will describe how the Texas Advanced Computing Center enables distribution and use scientific software via various approaches, including Jupyter notebooks, Github repositories, computation-oriented web service APIs, virtual machine images, and container technologies such as Docker, and how these approaches complement one another for training and education.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Arabidopsis Information Portal overview from Plant Biology Europe 2014
1. araport.org
Arabidopsis Information Portal: A
new approach to data sharing and
cooperative development
Matt Vaughn
Director, Life Sciences Computing
Texas Advanced Computing Center
2. araport.org
Overview
• Rationale for the AIP
• Strategic objectives
• Current state of the platform
• Data federation architecture
• Immediate future plans
• How you can participate
3. araport.org
The Rationale for AIP
• Loss of TAIR as a publicly funded shared
resource for data mining and basic
bioinformatics
• Centralization as a key contributing factor
– Loading of new data into database
– Development of new user experience
– Curation and annotation
– Community support mission
• AIP is designed to be de-centralized
6. araport.org
• Objectives
– Develop a community web resource
• Sustainably fundable and community-extensible
• Hosts diverse analysis & visualization tools + user data spaces
– Support Federation to integrate diverse data sets from
distributed data sources
– Maintain the Col-0 gold standard annotation
• Methods
– Assimilate TAIR10 data
– Host an Arabidopsis InterMine
– Develop a strategy to allow federation
– Offer and consume well-designed RESTful web services
– Interoperate with iPlant (and other projects) wherever
possible
The AIP Strategy (1)
7. araport.org
The AIP Strategy (2)
• Key Design Decisions
– Centralized (but powerful) data warehousing capability PLUS
infrastructure enabling data federation
– Jbrowse as a genome browser platform
– WebApollo + Tripal for community annotation
– App store model for graphical data interfaces (complete with 3rd
party developer path)
– Data store model for data sources
– Accessible languages and frameworks
– Secure & modern single-sign on
– Web service access to Arabidopsis data for powerful
bioinformatics
– Geo-replication and high availability
– Code re-use from other projects wherever possible
– Full code release in real time via GitHub
8. araport.org
Araport Bill of Materials
• AIP is currently built using
– InterMine*
– Jbrowse 1.11.3*
– Drupal 7.25*
• Developer-oriented content management system
– Angular.js, Bootstrap.js and other web toolkits
– Agave Software as a Service platform
• Developed by the iPlant Collaborative
• Bulk data, metadata, authentication, HPC app & job
management, notifications & events, and more
• OAuth2 single-sign-on
– Internally-developed API manager
*With extensive customization
12. araport.org
ThaleMine
Why InterMine?
① There aren’t a lot of real
Arabidopsis web services
② InterMine is a scalable,
extensible data warehouse
③ InterMine offers a rich,
extensible web application
④ InterMine offers high quality
REST APIs
⑤ InterMine is used by other
MODs
ThaleMine is an Arabidopsis-
specific deployment of InterMine
22. araport.org
What is a Science App?
– Written in HTML/CSS/Javascript using standard
frameworks
– Presented via web browser
• Query or Analyze, Present, Persist
– Developed by AIP and/or the community
• Deployed in AIP “app store”
• Choose which ones you want installed in your Araport
“dashboard”
– Uses AIP Data Architecture
• Data services: Local and remote query/retrieval
• Data integration and aggregation services
• Computation services
23. araport.org
Araport Architecture
Agave Enterprise Service Bus
CLI clients,
Scripts, 3rd party
applications
Physical
resources
HPC | Files | DB
Agave Services
apps
meta
files
profile
jobssystems
Araport API
Managermanage
enroll
a b c d e f
AIP & 3rd party data
providers
API Mediators
• Simple proxy
• Mediator
• Aggregator
• Filter
• Single-sign on
• Throttling
• Unified logging
• API versioning
• Automatic
HTTPS
REST*
REST-like
SOAP
POX
Cambrian CGI
24. araport.org
Data API Design Details (1)
• 100% RESTful services
• Queries are JSON objects (conforming to
a JSON schema)
• To enroll a new service in API Manager
– Specify the mapping between AIP query fields
and your service
– Map common query terms to minimal
controlled vocabulary
– Describe all service-specific parameters
25. araport.org
Data API Details (2)
When field mapping isn’t enough:
• Code-based transformations can be
specified via
– Python
– Java
– Ruby
– Javascript
• In technical terms, this is known as
MEDIATION
26. araport.org
Data API Details (3)
• Results returned in a standard Agave
JSON format*
– status, message, result
• Result is an array of JSON objects
• These conform to specific schemas
– drafts on AIP GitHub soon for comment
*Unless there’s an operational reason not to
27. araport.org
Data API Details (4)
• All Data APIs will implement:
– Count: How many records found?
– Pagination: Return only subsets
– Help: Return a usage page
– Convert: JSON (native), XML, CSV, etc
28. araport.org
• Docker.io for packaging
• Ultra-portable dev
environment
• Wide language
support
• Implicit security
model
• Scales horizontally
for performance
• Data API is package of
metadata + a Docker file
registered with a central
arbiter service
• Also used for services
written natively for AIP
Objectives: Facile development by
end users; simple, secure
deployment to AIP systems;
reasonable performance
Araport Data Federation Architecture
AGAVE
API MANAGER
https://github.com/waltermoreira/apim
30. araport.org
End result: Araport Data API Store
curl -X GET -k -v -L -b cookies
https://api.araport.org/store/site/blocks/api/listing/ajax/list.jag?action=getAllPublishe
dAPIs
{
"apis": [
{"name":"InteractionBrowser",
"provider":"vaughn",
"version":"pr2-0.1",
"context":"/data/BioAnalyticResource/interactionBrowser",
"status":"Deployed",
"thumbnailurl":"images/api-default.png",
"visibility":null,
"visibleRoles":null,
"description":"InteractionBrowser",
"apiOwner":"vaughn",
"isAdvertiseOnly":false},
31. araport.org
SNP data
Epigenomic data
via CoGe
RNA-seq for expression and structural
annotation
Aracyc
Co-expression data
Orthologs, trees, alignments
Various
genomes & data
sets
Community annotation using Web
Apollo and Tripal
Interactions
Plans for next 3-6 months
Developer support & training
32. araport.org
Feature AIP TAIR
GBrowse with TAIR10 data Yes Yes
JBrowse with TAIR10 data YES; also embedded in gene-info page) No
Epigenomic tracks from EPIC Yes No
Affymetrix expression data Yes (from BAR); embedded in gene-info pages Some but not searchable by locus
Protein interaction data Yes (from BAR; expansion planned) Similar data set; view through N-Browse
Gene-info/Locus-detail page (list data types)
gene sequence Yes Yes
CDS Yes Yes
GO annotation Yes Yes
PO and PATO 8/31/13 Legacy data Yes (8/31/13; some updates)
Curator summary Yes (TAIR; 8/31/13) Yes (8/31/13; some updates)
Computational description Yes (TAIR; 8/31/13) Yes (8/31/13; some updates)
Literature Yes; TAIR legacy, Uniprot and NCBI Yes; NCBI + some manual curation
Flexible query interface Yes No
Paywall NO YES
BLAST services Soon Yes
Web services Yes No
Data dowloads Yes Yes
Links to stock centers In progress Yes
1001 genomes SNP data In progress No
RNA-seq expression data Soon No
Updates to Col-0 sequence and annotation YES from AIP
As conceived and funded, AIP’s mission was to be a replacement for TAIR, emphasizing computational over
human curation and integrating a wider range of data types through federation. With the rebirth of TAIR
through a subscription mechanism, the roles of the two data centers in the Arabidopsis data marketplace has
become an evolving process. TAIR will continue its enrichment of Col-0 annotation through literature curation
etc. AIP will continue to aggregate and integrate data through a combination of federation via web services
and assimilation.
Relationship between AIP and TAIR
33. araport.org
Getting Involved with AIP
• User workshop at upcoming ICAR
• Formal developer engagement begins soon
– Developer discussion at the ICAR meeting on
conjunction with Araport Alpha release
– SDK and tutorials available thereafter
– 2-day dev workshop in Austin in Fall* 2014
• For now, send email at araport@jcvi.org
describing what you’d like to do
– We’ll reach out to you to discuss feasibility and
timelines via video conference
34. araport.org
Summary
• Next-generation MOD allowing community
participation in its development
• Powerful interactive query and analysis
functions available today
• Developing a data federation model
• New data sets and functions coming at a
quick pace
• Be on the lookout for participation
opportunities
35. araport.org
Chris Town, PI
Lisa McDonald
Education and
Outreach
Coordinator
Chris Nelson
Project
Manager
Jason Miller, Co-PI
JCVI Technical Lead
Erik Ferlanti
Software Engineer
Vivek Krishnakumar
Bioinf. Engineer
Svetlana Karamycheva
Bioinf Engineer
Eva Huala
Project lead, TAIR
Bob Muller
Technical lead, TAIR
Gos Micklem, co-PI Sergio Contrino
Software Engineer
Matt Vaughn
co-PI
Steve Mock
Portal Engineer
Rion Dooley,
API Engineer
Matt Hanlon,
Portal Engineer
Maria Kim
Bioinf Engineer
Ben Rosen
Bioinf
Analyst
Joe Stubbs,
API
Engineer
Walter Moreira,
API Engineer
38. araport.org
API Manager + Enterprise Service Bus
Araport architecture (2)
Secure, rationalized REST services
Consumer Applications
Simple
Proxy
ThaleMine, Data
integration, other
services
Cache
XML-to-
JSON
SOAP-to-
REST
CGI-to-
REST
Throttle
Legacy
API A
Legacy
API B
REST
API C
Simple
Proxy
• Single-sign on
• Throttling
• Unified
logging
• API versioning
• Mediation and
translation
• Dev-friendly
interfaces
• Rationalized
REST for
consumer
apps
Mediators
39. araport.org
Science Objectives
• Make more, varied data available to the
Arabidopsis (and other) communities
within a unified user experience
• Enhance the innate value of data by
offering enhanced search, retrieval, and
display capabilities
• Facilitate analysis of user data
• Enable community participation in
functional annotation
40. araport.org
Technical Objectives
• Deploy a responsive, flexible community-
extensible system
• Provide APIs everywhere!
• Promote and facilitate data integration
• Enable language- and region-specific
presentation of scientific content
• Meet mobile computing on its own terms
41. araport.org
Local vs. Data-driven Apps
Resources are local and
inherently offline. Operating
on local data using local
computing.
Resources are cloud-based and
inherently online. Multiple data
streams integrated, queried,
presented in context of broader
objective.
Photoshop Express KAYAK Pro
42. araport.org
Araport Bill of Materials
• Araport is currently built using
– Drupal 7.25
• Developer-oriented content management system
– Bootstrap.js and some other Javascript toolkits
– InterMine (with modifications)
– Bioinformatics infrastructure + misc. other bits
– Agave 2.0 Software as a Service platform
• Developed by iPlant Collaborative project
• Bulk data, metadata, authentication, HPC app and job
management, notifications & events, and more
• OAuth2 out of the box
• Enterprise service bus (ESB) architecture
• http://agaveapi.co/
43. araport.org
Agave wso2 interface
Cache (Technology TBD)
CSV
Araport APIM Architecture (1)
POLYMORPH CGI
Form
Input Key
Map
Output
Key Map
Input
Transform
Output
Transform
Listen Respond
Send Listen
Input Key
Map
Output
Key Map
Input
Transform
Output
Transform
Listen Respond
Send Listen
Araport API
Manager
JSON Query JSON Response
ElasticSearch
Remote Services
SNP by Locus REST Indel by Position REST Enroll Manage
44. araport.org
Araport Architecture: Use Cases (1)
• 1001 Genomes POLYMORPH tools
– Provides variation data via locus or positional
search
– Total of seven variant types available for search
– Search parameterization depends a lot on variant
type
– Example of a plain-text CGI service
– Returns results as CSV with named columns
• Objective: Transform into a RESTful API that
expects and returns rationalized JSON
http://polymorph.weigelworld.org
45. araport.org
Araport Architecture: Use Cases (2)
• ThaleMine
– Has native REST interface for general queries
– Has templates which can form basis of specific
services
• Objective: Offer both Intermine-native and
AIP-conformant interfaces as Data APIs
• Current path
– Enroll native services in our APIM
– Develop template-based AIP-conformant services
http://polymorph.weigelworld.org
46. araport.org
Data APIs: Getting Started
Service Queries Notes
BAR eFP Locus
BAR Expressologs Locus
BAR Interactions Locus
COGe Position Special case – output transform only
NASC $SERVICE Locus
SOAP based but may be offline
permanently
OrthologFinder Locus Based on a Thalemine template
POLYMORPH Locus, Position Actually seven CGI services
SUBA3 Locus
Compiling example queries, parameter mapping and description, and ideal
results for use in implementing the system
47. araport.org
Developing a Data API
• In order, we prefer that you have ready
• Well-documented REST
• Moderately well-documented REST
• SOAP services (plus WSDL or WADL)
• Plain Old XML
• Plaintext CGI
• HTML CGI
• No web services at all
• Work with us to enroll your services as a data
source. This will involve a minor amount of
coding.
48. araport.org
Computational App Model (1)
Host file
systems
Host OS
Docker.io
Centos
6.4
custom-
repo
Container
/scratch
/database
Host FS (250 GB)
TACC Corral (PB+)
sftp
Agave apps, data, jobs
REST API x JSON objects
49. araport.org
Science Apps: Grid View
• Current Scheme
• 2-3 column view w
draggable apps
• Apps are normal, full-
size, or collapsed
• Single app screen
• Later in 2014
• N x X grid scheme
implementing resizable
app “tiles” like one sees
in Android or Win8.x
• App SDK libraries will
have “help” for enabling
resizable design
• Multiple app screens
50. araport.org
Data API Details (2)
• For service-specific parameters
– Provide human-readable names mapped to original
parameter names
– Offer minimal descriptive text
– Specify validation
• Cardinality
• Pattern validator (regex)
• Type (number, string, etc.)
– Indicate whether required
– Indicate whether they should be visible in a UI
– Specify reasonable default values
• Seems familiar?
– This approach is used to to abstract command line apps
– Allows automatic generation of minimally functional UI
51. araport.org
Data APIs: Response types (1)
• locus_relationship – pairwise
relationship between A and B
– Directionality
– Type
– Array of scores (weights, etc.)
• sequence_feature – positional attribute
– Extension of GFF model plus
– Build
– Attributes array
52. araport.org
Data APIs: Response types (2)
• locus_feature – key-value attributes per locus
– Optional controlled vocabulary* for keys
– Support for both slots and arrays
• raw – for returning images or other binary formats
– Source and other metadata carried in X-headers instead of
JSON result
– Outbound transformation still supported
– Not a preferred response mode
• text – returning either native service response or a
non-conformant JSON document
– Source and other metadata carried in X-headers instead of
JSON result
– Not a preferred response mode
53. araport.org
Data API Details (6)
• Transparent caching will compensate for
transient remote service failures
• Automatic indexing of certain response
types via ElasticSearch, allowing for
sophisticated global search
– ElasticSearch allows us to index everything
we “know about” and return it quickly
– iPlant uses it to live-index >700 TB user data
54. araport.org
Developing an app
• Understand and document the user stories you’re
addressing with your app
• Identify all requisite data sources AND
• Help us prepare them as Data APIs
– This may involve coding
• Understand the data integration or aggregation needs
of your app
– This may involve coding
• Develop the user interface(s) for your app using our
tool kits and suggested practices
– This will involve coding.
– But you will learn tools like jQuery, Bootstrap, & D3 and will
thus be eminently employable!
Editor's Notes
Discuss the IAIC design process. Working groups, design workshop, Plant Cell whitepaper. Chris Town and I were selected to submit a proposal to realize this vision to NSF ABI. Funded Sep 1, 2013.
5 MINUTES
Upcoming release of the portal – simple, extensible
Developed in collaboration with Gos Micklem at Cambridge. They are funded by BBSRC to support this activity.
Users can search by gene locus, synonym, key word etc.
Shows gene locus, function, aliases computational description and curator summary (from TAIR).
Jbrowse (TAIR10 data), orthologs from other model organisms (plants will come), proteins from UniProt, GO from Amigo
10 MINUTES
15 MINUTES
Protemics Standard Initiative Common QUery InterfaCe