Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Democratizing Data within your organization - Data Discovery

Mark Grover

Data Science in 2016: Moving Up

NoSQL Databases, Not just a Buzzword

Haitham El-Ghareeb

Strata sf - Amundsen presentation

Tao Feng

Beyond Kaggle: Solving Data Science Challenges at Scale

Turi, Inc.

Disrupting Data Discovery

Gephi, Graphx, and Giraph

Doug Needham

Data Discovery & Trust through Metadata

Data Discovery and Metadata

How Lyft Drives Data Discovery

Speaker: Philippe Mizrahi - Associate Product Manager - Lyft Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x. During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.

How Lyft Drives Data Discovery

Applied Machine learning using H2O, python and R Workshop

Avkash Chauhan

Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup Basic knowledge of R/python and general ML concepts Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop Level: 200 Time: 2 Hours Agenda: - Introduction to ML, H2O and Sparkling Water - Refresher of data manipulation in R & Python - Supervised learning ---- Understanding liner regression model with an example ---- Understanding binomial classification with an example ---- Understanding multinomial classification with an example - Unsupervised learning ---- Understanding k-means clustering with an example - Using machine learning models in production - Sparkling Water Introduction & Demo

Intake at AnacondaCon

Martin Durant

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Mo Patel

Meetup SF - Amundsen

Philippe Mizrahi

Tds — big science dec 2021

Gérard Dupont

https://bigscience.huggingface.co/ EN: Presentation of the BigScience project: a research initiative launched by HuggingFace and aiming to build a large language model (inspired by OpenAI and GPTx) over multiple languages and a very large processing cluster. The participants plan to investigate the dataset and the model from all angles: bias, social impact, capabilities, limitations, ethics, potential improvements, specific domain performances, carbon impact, general AI/cognitive research landscape. FR : Présentation du projet Bigscience : un projet de recherche ouvert lancé par HuggingFace et qui a pour objectif de contruire un modèle de langue (ie un peu comme openAI et GPT-3) mais en explorant les problèmes liés au jeux de données et au modèle selon les angles des biais cognitifs, de l'impact social et environemental, des limites éthiques, des possibles gain de performance et de l'impact général de ce type d'approche lorsque le but n'est pas seulement "d'avoir un plus gros modèle".

Spark Summit Europe: Share and analyse genomic data at scale

Andy Petrella

What's hot

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights

SF Python Meetup: TextRank in Python

Data Science with Spark - Training at SparkSummit (East)

Krishna Sankar

Optimizing Application Architecture (.NET/Java topics)Ravi Okade

A New Year in Data Science: ML Unpaused

Democratizing Data within your organization - Data Discovery

Mark Grover

Data Science in 2016: Moving Up

NoSQL Databases, Not just a Buzzword

Haitham El-Ghareeb

Strata sf - Amundsen presentation

Tao Feng

Beyond Kaggle: Solving Data Science Challenges at Scale

Turi, Inc.

Disrupting Data Discovery

Gephi, Graphx, and Giraph

Doug Needham

Data Discovery & Trust through Metadata

Data Discovery and Metadata

How Lyft Drives Data Discovery

How Lyft Drives Data Discovery

WU (Vienna University of Economics and Business)

Applied Machine learning using H2O, python and R Workshop

Avkash Chauhan

Intake at AnacondaCon

Martin Durant

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Mo Patel

Meetup SF - Amundsen

Philippe Mizrahi

What's hot (20)

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

SF Python Meetup: TextRank in Python

Data Science with Spark - Training at SparkSummit (East)

Optimizing Application Architecture (.NET/Java topics)

A New Year in Data Science: ML Unpaused

Democratizing Data within your organization - Data Discovery

Data Science in 2016: Moving Up

NoSQL Databases, Not just a Buzzword

Strata sf - Amundsen presentation

Beyond Kaggle: Solving Data Science Challenges at Scale

Disrupting Data Discovery

Gephi, Graphx, and Giraph

Data Discovery & Trust through Metadata

Data Discovery and Metadata

How Lyft Drives Data Discovery

Applied Machine learning using H2O, python and R Workshop

Intake at AnacondaCon

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Meetup SF - Amundsen

Similar to Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?

Tds — big science dec 2021

Gérard Dupont

Spark Summit Europe: Share and analyse genomic data at scale

Andy Petrella

Hadoop @ Sara & BiG GridEvert Lammerts

(Big) Data (Science) Skills

Oscar Corcho

Scaling the (evolving) web data –at low cost-

Vital AI: Big Data Modeling

Vital.AI

Video: https://www.youtube.com/watch?v=Rt2oHibJT4k Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models. These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it. This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers: Creating Semantic Metadata Models of Big Data Resources Graphical UI Tools for Big Data Models Tools to synchronize Big Data Models and Application Code Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference Using Big Data Models with Machine Learning to generate Predictive Models Developer Collaborative/Coordination processes using Big Data Models and Git Managing change – Big Data Models with rapidly changing Data Resources

AMP Camp 5 Intro

jeykottalam

eScience: A Transformed Scientific Method

Duncan Hull

Big Data, Beyond the Data Center

Gilles Fedak

Big Data, Beyond the Data Center Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.

Towards a rebirth of data science (by Data Fellas)

Andy Petrella

Nowadays, Data Science is buzzing all over the place. But what is a, so-called, Data Scientist? Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data. However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial. In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results. Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data. The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.

Big Data Meetup #7

Paul Lo

BigData: My Learnings from data analytics at Uber Reference (highly recommended): * Designing Data-Intensive Applications http://bit.ly/big_data_architecture * Big Data and Machine Learning using Python tools http://bit.ly/big_data_machine_learning * Uber Engineering Blog http://eng.uber.com * Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale http://bit.ly/hadoop_guide_bigdata

FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...

FIWARE

A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Stuart Chalk

The current movement toward openness and sharing of data is likely to have a profound effect on the speed of scientific research and the complexity of questions we can answer. However, a fundamental problem with currently available datasets (and their metadata) is heterogeneity in terms of implementation, organization, and representation. To address this issue we have developed a generic scientific data model (SDM) to organize and annotate raw and processed data, and the associated metadata. This paper will present the current status of the SDM, implementation of the SDM in JSON-LD, and the associated scientific data model ontology (SDMO). Example usage of the SDM to store data from a variety of sources with be discussed along with future plans for the work.

INF2190_W1_2016_publicAttila Barta

NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...

Paolo Nesi

Abstract—The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

Spark Summit

GeoKettle: A powerful open source spatial ETL tool

Thierry Badard

DataHub

Aditya Parameswaran

Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015

Mark Wilkinson

The Dendro research data management platform: Applying ontologies to long-ter...

João Rocha da Silva

It has been shown that data management should start as early as possible in the research workflow to minimize the risks of data loss. Given the large numbers of datasets produced every day, curators may be unable to describe them all, so researchers should take an active part in the process. However, since they are not data management experts, they must be provided with user-friendly but powerful tools to capture the context information necessary for others to interpret and reuse their datasets. In this paper, we present Dendro, a fully ontology-based collaborative platform for research data management. Its graph data model innovates in the sense that it allows domain-specific lightweight ontologies to be used in resource description, acting as a staging area for later deposit in long-term preservation solutions.

Similar to Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on? (20)

Tds — big science dec 2021

Spark Summit Europe: Share and analyse genomic data at scale

Hadoop @ Sara & BiG Grid

(Big) Data (Science) Skills

Scaling the (evolving) web data –at low cost-

Vital AI: Big Data Modeling

AMP Camp 5 Intro

eScience: A Transformed Scientific Method

Big Data, Beyond the Data Center

Towards a rebirth of data science (by Data Fellas)

Big Data Meetup #7

FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...

A Generic Scientific Data Model and Ontology for Representation of Chemical Data

INF2190_W1_2016_public

NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

GeoKettle: A powerful open source spatial ETL tool

DataHub

Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015

The Dendro research data management platform: Applying ontologies to long-ter...

More from Dataconomy Media

Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...

Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...

The challenges of increasing complexity of organizations, companies and projects are obvious and omnipresent. Everywhere there are connections and dependencies that are often not adequately managed or not considered at all because of a lack of technology or expertise to uncover and leverage the relationships in data and information. In his presentation, Axel Morgner talks about graph technology and knowledge graphs as indispensable building blocks for successful companies.

Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...

Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...

Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...

Compliance departments within banks and other financial institutions are turning to machine learning for improving their Anti Money Laundering compliance activities. Today, the systems that aim to detect potentially suspicious activity are commonly rule-based, and suffer from ultra-high false positive rates. DataRobot will discuss how their Automated Machine Learning platform was successfully used for a real use case to reduce their false positives and to enhance their Anti-Money Laundering activities.

Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...

Trump, Brexit, Cambridge Analytica... In the last few years, we have had to confront the consequences of the use and misuse of data science algorithms in manipulating public opinion through social media. The use of private data to microtarget individuals is a daily practice (and a trillion-dollar industry), which has serious side-effects when the selling product is your political ideology. How can we cope with this new scenario?

Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...

Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...

Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...

What does it take to build a good data product or service? Data practitioners always think about the technology, user experience and commercial viability. But rarely do they think about the implications of the systems they build. This talk will shed light on the impact of AI systems and the unintended consequences of the use of data in different products. It will also discuss our role, as data practitioners, in planting the seeds of fairness in the systems we build.

Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...

Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...

Cloud Infrastructure is a hostile environment: a power supply failure or a network outage leads to downtime and big losses. There is nothing we can trust: a single server, a server rack, even a whole datacenter can fail, and if an application is fragile by design, disruption is inevitable. We must distribute our application and diversify cloud data strategy to survive disturbances of any scale. Apache Cassandra is a cloud-native platform-agnostic database that stores data with a distributed redundancy so it easily survives any issue. What to know how Apple and Netflix handle petabytes of data, keeping it highly available? Join us and listen to a story of 10 little servers and no downtime!

Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...

Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...

Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...

Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...

Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...

Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...

Creativity is the mental ability to create new ideas and designs. Innovation, on the other hand, Means developing useful solutions from new ideas. Creativity can be goal-oriented, Whereas innovation is always goal-oriented. This bedeutet, dass innovation aims to achieve defined goals. The use of cloud services and technologies promises enterprise users many benefits in terms of more flexible use of IT resources and faster access to innovative solutions. That’s why we want to examine the question in this talk, of what role cloud computing plays for innovation in companies.

Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...

Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...

"With most machine learning (ML) and deep learning (DL) frameworks, it can take hours to move data for ETL, and hours to train models. It's also hard to scale, with data sets increasingly being larger than the capacity of any single server. The amount of the data also makes it hard to incrementally test and retrain models in near real-time. Learn how Apache Ignite and GridGain help to address limitations like ETL costs, scaling issues and Time-To-Market for the new models and help achieve near-real-time, continuous learning. Yuriy Babak, the head of ML/DL framework development at GridGain and Apache Ignite committer, will explain how ML/DL work with Apache Ignite, and how to get started. Topics include: — Overview of distributed ML/DL including architecture, implementation, usage patterns, pros and cons — Overview of Apache Ignite ML/DL, including built-in ML/DL algorithms, and how to implement your own — Model inference with Apache Ignite, including how to train models with other libraries, like Apache Spark, and deploy them in Ignite — How Apache Ignite and TensorFlow can be used together to build distributed DL model training and inference"

Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...