Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
Ying Xie & Pooja Chenna, Kennesaw State University, present at the 2015 HPCC Systems Engineering Summit Community Day. Deep learning has been emerged as a new breakthrough in the area of Machine Learning and revitalized the research in Artificial Intelligence (AI). Deep learning techniques have been widely used in image recognition and natural language processing. In this presentation, we will show our implementation of an important deep neural network architecture that is called Deep Belief Network in ECL. We will further illustrate how to apply our implementation to solve the problem of network intrusion detection based on massive data sets on HPCC Systems. Additionally, we will report our study on how to optimally configure a stacked auto encoder, another deep learning architecture, on HPCC Systems for the purpose of both supervised and unsupervised learning on different types of data sets.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
Ying Xie & Pooja Chenna, Kennesaw State University, present at the 2015 HPCC Systems Engineering Summit Community Day. Deep learning has been emerged as a new breakthrough in the area of Machine Learning and revitalized the research in Artificial Intelligence (AI). Deep learning techniques have been widely used in image recognition and natural language processing. In this presentation, we will show our implementation of an important deep neural network architecture that is called Deep Belief Network in ECL. We will further illustrate how to apply our implementation to solve the problem of network intrusion detection based on massive data sets on HPCC Systems. Additionally, we will report our study on how to optimally configure a stacked auto encoder, another deep learning architecture, on HPCC Systems for the purpose of both supervised and unsupervised learning on different types of data sets.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
BioTeam Bhanu Rekepalli Presentation at BICoB 2015The BioTeam Inc.
Adapting life sciences applications to next generation supercomputers with Intel Xeon Phi coprocessors to accelerate large scale data analysis and discovery.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
Covers different types of big data benchmarking, different suites, details into terasort, demo with TPCx-HS
Meetup Details of presentation:
http://www.meetup.com/lspe-in/events/203918952/
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
BioTeam Bhanu Rekepalli Presentation at BICoB 2015The BioTeam Inc.
Adapting life sciences applications to next generation supercomputers with Intel Xeon Phi coprocessors to accelerate large scale data analysis and discovery.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
Covers different types of big data benchmarking, different suites, details into terasort, demo with TPCx-HS
Meetup Details of presentation:
http://www.meetup.com/lspe-in/events/203918952/
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
The Hadoop Ecosystem for developers session in DevGeekWeek in Israel.
This was a day long session talking about big data problems and the hadoop solution. we also talked about Spark and NoSQL.
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
This is the presentation I used for Oracle Week 2016 session about Apache Spark.
In the agenda:
- The Big Data problem and possible solutions
- Basic Spark Core
- Working with RDDs
- Working with Spark Cluster and Parallel programming
- Spark modules: Spark SQL and Spark Streaming
- Performance and Troubleshooting
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
This is a presentation on apache hadoop technology. This presentation may be helpful for the beginners to know about the terminologies of hadoop. This presentation contains some pictures which describes about the working function of this technology. I hope it will be helpful for the beginners.
Thank you.
This presentation is about apache hadoop technology. This may be helpful for the beginners. The beginners will know about some terminologies of hadoop technology. There is also some diagrams which will show the working of this technology.
Thank you.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
Big Data Anti-Patterns: Lessons from the Front Lines
Drawn from over 50 client engagements, big data anti-patterns are common practices that make for bad solutions.
Similar to Hadoop for Bioinformatics: Building a Scalable Variant Store (20)
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
Genomics and life sciences is using antiquated technology for processing data. As the data volume is increasing in the life sciences, many in the biology community are reinventing the wheel, without realizing the existence of a rich ecosystem of tools for processing large data sets: Hadoop.
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
5. Indexing the Web
•
Web is Huge
•
•
How do you index it?
•
•
•
•
5
Hundreds of millions of pages in 1999
Crawl all the pages
Rank pages based on relevance metrics
Build search index of keywords to pages
Do it in real time!
7. Databases in 1999
1.
2.
3.
4.
5.
7
Buy a really big machine
Install expensive DBMS on it
Point your workload at it
Hope it doesn’t fail
Ambitious: buy another big machine as backup
11. Google does something different
•
Designed their own storage and processing
infrastructure
•
•
Google File System (GFS) and MapReduce (MR)
Goals: KISS
Cheap
• Scalable
• Reliable
•
11
12. Google does something different
It worked!
• Powered Google Search for many years
• General framework for large-scale batch computation
tasks
• Still used internally at Google to this day
•
12
14. Birth of Hadoop at Yahoo!
2004-2006: Doug Cutting and Mike Cafarella
implement GFS/MR.
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant
•
14
15. Open-source proliferation
Google
Open-source
Function
GFS
HDFS
Distributed file system
MapReduce
MapReduce
Batch distributed data processing
Bigtable
HBase
Distributed DB/key-value store
Protobuf/Stubby
Thrift or Avro
Data serialization/RPC
Pregel
Giraph
Distributed graph processing
Dremel/F1
Cloudera Impala
Scalable interactive SQL (MPP)
FlumeJava
Crunch
Abstracted data pipelines on Hadoop
Hadoop
15
17. HDFS design assumptions
Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
•
•
•
Massive scale means failures very likely
Disk, node, or network failures
Accesses are large and sequential
• Files are append-only
•
17
20. MapReduce computation
•
Structured as
1.
2.
3.
Embarrassingly parallel “map stage”
Cluster-wide distributed sort (“shuffle”)
Aggregation “reduce stage”
Data-locality: process the data where it is stored
• Fault-tolerance: failed tasks automatically detected
and restarted
• Schema-on-read: data must not be stored conforming
to rigid schema
•
20
30. Serialization/RPC formats
struct Observation {
// can be general contig too
1: required string chromosome,
// python-style 0-based slicing
2: required i64 start,
3: required i64 end,
// unique identifier for data set
// (like UCSC genome browser track)
4: required string track,
// these are likely derived from the
// track; separated for convenient join
5: optional string experiment,
6: optional string sample,
// one of these should be non-null,
// depending on the type of data
7: optional string valueStr,
8: optional i64 valueInt,
9: optional double valueDouble
}
30
33. Parquet format advantages
•
Columnar format
•
•
•
read fewer bytes
compression more efficient (incl. dictionary encodings)
Thrift/Avro/Protobuf-compatible data model
•
Support for nested data structures
Binary encodings
• Hadoop-friendly (“splittable”; implemented in Java)
• Predicate pushdown
• http://parquet.io/
•
33
35. Core paradigm shifts with Hadoop
Colocation of storage and compute
Fault tolerance with cheap hardware
35
36. Benefits of Hadoop ecosystem
•
Inexpensive commodity compute/storage
•
•
Tolerates random hardware failure
Decreased need for high-bandwidth network pipes
Co-locate compute and storage
• Exploit data locality
•
•
Simple horizontal scalability by adding nodes
•
•
•
•
36
MapReduce jobs effectively guaranteed to scale
Fault-tolerance/replication built-in. Data is durable
Large ecosystem of tools
Flexible data storage. Schema-on-read. Unstructured data.
38. HPC separates compute from storage
HPC is about compute.
Hadoop is about data.
Storage infrastructure
• Proprietary, distributed
file system
• Expensive
Compute cluster
Big network
pipe ($$$)
• High-performance
hardware
• Low failure rate
• Expensive
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
38
39. Hadoop colocates compute and storage
HPC is about compute.
Hadoop is about data.
Compute cluster
Storage infrastructure
• Commodity hardware
• Data-locality
• Reduced networking
needs
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
39
40. HPC is lower-level than Hadoop
HPC only exposes job scheduling
• Parallelization typically occurs through MPI
•
•
•
Very low-level communication primitives
Difficult to horizontally scale by simply adding nodes
Large data sets must be manually split
• Failures must be dealt with manually
•
•
40
Hadoop has fault-tolerance, data locality, horizontal
scalability
41. File system as DB; text file as LCD
Broad joint caller with 25k genomes hits file handle
limits
• Files streamed over network (HPC architecture)
• Large files split manually
• Sharing data/collaborating involves copying large files
•
41
42. Job scheduler as workflow tool
Submitting jobs to scheduler is very low level
• Workflow engines/execution models provide high
level execution graphs with fault-tolerance
•
•
42
e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pi
g, Hive
43. Poor security/access models
•
Deal with complex set of constraints from a variety of
consents/redactions
•
•
•
•
43
Certain individuals redact certain parts of their genomes
Certain samples can only be used as controls for particular
studies
Different research groups want to control access to the
data they generate
Clinical trial data must have more rigorous access
restrictions
44. Treating computation as free
Many institutions make large clusters available for
“free” to the average researcher
• Focus of dropping sequencing cost has been on
biochemistry
•
44
45. Treating computation as free
Stein, L. D. The case for cloud computing in genome informatics. Genome Biol (2010).
45
46. Treating computation as free
Sboner et al. “The real cost of sequencing: higher than you think”. Genome Biology (2011).
46
47. Lack of benchmarks for tracking progress
•
Need to benchmark whether quality of methods are
improving
http://www.nist.gov/mml/bbd/ppgenomeinabottle2.cfm
47
48. Lack of benchmarks for tracking progress
Bradnam et al. “Assemblathon 2”, Gigascience 2, 10 (2013).
48
49. Academic code
Unreproducible, unbuildable, undocumented, unmainta
ined, unavailable, backward-incompatible, shitty code
Most developers self-taught. Only one-third
think formal training is important. [1, 2]
“…people in my lab have requested code from authors
and received source code with syntax errors in it” [3]
[1]: Haussler et al. “A Million Cancer Genome Warehouse” (2012)
[2]: Hannay et al. “How do scientists develop and use scientific software?” (2009)
[3]: http://ivory.idyll.org/blog/on-code-review-of-scientific-code.html
49
55. Move to Hadoop-style environment
Data centralization on HDFS
• Data-local execution to avoid moving terabytes
• Higher-level execution engines to abstract away
computations from details of execution
• Hadoop-friendly, evolvable, serialization formats for:
•
•
•
•
55
Storage- and compute-efficiency
Abstracting data model from data storage details
Built-in horizontal scalability and fault-tolerance
56. APIs instead of file formats
Service-oriented architectures ensure stable contracts
• Allows for implementation changes with new
technologies
• Software community has lots of experience with this
type of architecture, along with mature tools.
• Can be implemented as language-independent.
•
56
57. High-granularity access/common consent
1.
Use technologies with highly-granular access
control
•
2.
Create common consents for patients to “donate”
their data to research
•
57
e.g., Apache Accumulo, cell-based access control
e.g., Personal Genome Project, SAGE Portable Legal
Consent, NCI “information donor”
58. Tools for open-source/reproducibility
Software and computations should be opensourced, e.g., on GitHub
• Release VMs or ipython notebooks with publications
•
•
•
58
“executable paper” to generate figures
Allow others to easily recompute all analyses
63. ADAM
• Defining alternative to BAM format that’s
• Hadoop-friendly, splittable, designed for
distributed computing
• Format built as Avro objects
• Data stored as Parquet format (columnar)
• Attempting to reimplement GATK pipeline to function
on Hadoop/Parquet
• Currently run out of the AMPLab at UC Berkeley
63
65. Querying large, integrated variant data
Biotech client has thousands of genomes
• Want to expose ad hoc querying functionality on large
scale
•
•
•
Integrating data with public data sets
(e.g., ENCODE, UCSC tracks, dbSNP, etc.)
•
65
e.g., vcftools/PLINK-SEQ on terabyte-scale data sets
Terabyte-scale annotation sets
66. Conventional approaches: manual
•
Manually parsing flat files
•
•
•
Write ad hoc scripts in perl or python
Build data structures in memory for
histograms/aggregations
Custom script per query
counts_dict = {}
for chain in vdj.parse_VDJXML(inhandle):
try: counts_dict[chain.junction] += 1
except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues():
print >>outhandle, np.int_(count)
66
67. Conventional approaches: database
•
Very feature rich and mature
•
•
•
Common analytical tasks (e.g., joins, group-by, etc.)
Access control
Very mature
Scalability issues
• Indices can be prohibitive
• RDBMS: schemas can be annoyingly rigid
• NoSQL: adolescent implementations (but easy to
start)
•
67
69. Hadoop sol’n: storage
•
Impala/Hive metastore provide a unified, flexible data
model
•
•
69
Define Avro types for all data
Data stored as Parquet format to maximize
compression and query performance
70. Hadoop sol’n: available analytics engines
•
•
•
•
•
70
Analytical operations implemented by experts in
distributed systems
Impala implements RDBMS-style operations
Search offers metadata indexing
Spark offers in-memory processing for ML
HDFS-based analytical engines designed for horizontal
scalability
72. Example schema
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID
REF ALT QUAL FILTER INFO
FORMAT
NA00001
NA00002
NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 .
T . 47 PASS NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
72
73. Example schema
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID
REF ALT QUAL FILTER INFO
FORMAT
NA00001
NA00002
NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 .
T . 47 PASS NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
73
74. Example schema
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID
REF ALT QUAL FILTER INFO
FORMAT
NA00001
NA00002
NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 .
T . 47 PASS NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
74
75. Why denormalization is good
•
Replace joins with filters
•
•
•
75
For query engines with efficient scans, this simplifies
queries and can improve performance
Parquet format supports predicate pushdowns, reducing
necessary I/O
Because storage is cheap, amortize cost of up-front
join over simpler queries going forward
77. Example variant-filtering query
•
“Give me all SNPs that are:
•
•
•
•
•
•
77
on chromosome 5
absent from dbSNP
present in COSMIC
observed in breast cancer samples
absent from prostate cancer samples”
On full 1000 genome data set (~37 billion
variants), query finishes in a couple seconds
78. Example variant-filtering query
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";
78
79. Impala execution
Query compiled into execution tree, chopped up
across all nodes (if possible)
• Two join implementations
•
1.
2.
Broadcast: each node gets copy of full right table
Shuffle: both sides of join are partitioned
Partitioned tables vastly reduce amount of I/O
• File formats make enormous difference in query
performance
•
79
80. Other desirable query-examples
“How do the mutations in a given subject compare to
the mutations in other phenotypically similar
subjects?”
• “For a given gene, in what pathways and cancer
subtypes is it involved?” (connecting phenotypes to
annotations)
• “How common are an observed set of mutations?”
• “For a given type of cancer, what are the
characteristic disruptions?”
•
80
81. Types of queries desired
Lot’s of these queries can be simply translated into
SQL queries
• Similar to functionality provided by PLINK/SEQ, but
designed to scale to much larger data sets
•
81
82. All-vs-all eQTL
•
Possible to generate trillions of hypothesis tests
•
•
•
107 loci x 104 phenotypes x 10s of tissues = 1012 p-values
Tested below on 120 billion associations
Example queries:
•
“Given 5 genes of interest, find top 20 most significant
eQTLs (cis and/or trans)”
•
•
“Find all cis-eQTLs across the entire genome”
•
•
82
Finishes in several seconds
Finishes in a couple of minutes
Limited by disk throughput
83. All-vs-all eQTL
•
“Find all SNPs that are:
•
•
•
in LD with some lead SNP
or eQTL of interest
align with some functional
annotation of interest”
Still in testing, but likely
finishes in seconds
Schaub et al, Genome Research, 2012
83
84. Conclusions
Hadoop ecosystem provides centralized, scalable
repository for data
• An abundance of tools for providing views/analytics
into the data store
•
•
Separate implementation details from data pipelines
Software quality/data structures/file formats matter
• Genomics has much to gain from moving away from
HPC architecture toward Hadoop ecosystem
architecture
•
84
85. Cloud-based implementation
Hadoop-ecosystem architecture easily translates to
the cloud (AWS, OpenStack)
• Provides elastic capacity; no large initial CAPEX
• Risk of vendor lock-in once data set is large
• Allows simple sharing of data via public S3
buckets, for example
•
85
86. Future work
•
Broad Institute has experimented with Google’s
BigQuery for a variant store
•
•
BigQuery is Google’s Dremel exposed to public on Google’s
cloud
Closed-source, only Google cloud
Developed API for working with variant data
• Soon develop Impala-backed implementation of
Broad API
•
•
86
To be open-sourced
87. Future work
Drive towards several large data warehouses; storage
backend optimized for particular access patterns
• Each can expose one or more APIs for different
applications/access levels.
• Haussler, D. et al. A Million Cancer Genome
Warehouse. (2012). Tech Report.
•
87
Industry had at least some of these problems. Here’s how they solved them.
Already mature technologies at this point.DB community thought it was silly.Non-Google were not yet at this scale.Google not in the business of releasing infrastructure software. They sell ads.
Mostly through the Apache Software Foundation
Talk HDFS and MapReduce.Then some other tools.
Community Is coalescing around HDFS
Large blocksBlocks replicated around
Two functions required.
Only need to supply 2 functions
Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
Up until a couple years ago, Hadoop was just MapReduce.
Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
Designed to do the task like BigQuery or Teradata
Interface description language
Compare with UCSC file formats:
Interface description language
Interface description language
Show more general version later: ADAM
I don’t mean to indict anyone in particular.
Not so much a sin, but 15 year old architecture.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
Log scale.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
If you have a gun-fight in the data center, your MapReduce job will still finish.
Define ETL
Define ETL
Define ETL
Trying to work within the Global AllianceAlong with MSSM, Broad, others
Once you have a VCF, what to do with it?True that data has compressed massively, but can still get large
Data here is already denormalizedKicked off the trajectory that landed me at Cloudera
My first MapReduce job was written in MongoDB’s JavaScript aggregation engine.