The document provides an introduction to Hadoop concepts including the core projects within Hadoop and how they fit together. It discusses common use cases for Hadoop across different industries and provides examples of how Hadoop can be used for tasks like social network analysis, content optimization, network analytics, and more. The document also summarizes key Hadoop concepts including HDFS, MapReduce, Pig, Hive, HBase and gives examples of how Hadoop can be applied in domains like financial services, science, energy and others.
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...Cloudera, Inc.
This presentation will explore how Hadoop and Big Data are re-inventing enterprise workflows, and the pivotal role of the Data Analyst. It will examine the changing face of analytics and the streamlining of iterative queries through evolved user interfaces. The speaker will cut through hype around “shorter time to insight” and explain how combining Hadoop and SQL-based analytics help companies discover emergent trends hidden in unstructured data, without having to retrain data miners or restaff. In particular, it will highlight changes to Big Data analysis from this paradigm and illustrate stepwise how analysts can now connect to Big Data platforms, assemble working data sets from disparate sources, analyze and mine that data for actionable insight, publish the results as visualizations and for feeding reporting tools, and operationalize Map-Reduce and Big Data outcomes into company workflows – all without touching the command line.
Presentation at the Bernadotte Academy in 2007 by Infosphere CEO Mats Björe about the concept of OSINT. Examples from tools like Silobreake is included
Kirti Vashee, Vice-President Enterprise Translation Sales, Asia Online
Rustin Gibbs, Solutions Architect, Moravia Worldwide
Kirti and Rustin provide insights into an innovative approach to the practical use of MT in situations where the bilingual data is of insufficient volume and the monolingual data is of unclear relevance Kirti and Rustin provide examples from travel and publishing industries to show the individual steps of the process to equip participants with information on what language and language technology tools exist to build a high-quality translation engine.
Hadoop World 2011: Hadoop Trends & Predictions - Vanessa Alverez, ForresterCloudera, Inc.
Hadoop is making its way into the enterprise, as organizations look to extract valuable information and intelligence from the mountains of data in their storage environments. The way in which this data is analyzed and stored is changing, and Hadoop has become a critical part of this transformation. In this session, Vanessa will cover the trends we are seeing in the enterprise in regards to Hadoop adoption and how it’s being used, as well as predictions on where we see Hadoop and Big Data in general, going as we enter 2012.
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...Cloudera, Inc.
This presentation will explore how Hadoop and Big Data are re-inventing enterprise workflows, and the pivotal role of the Data Analyst. It will examine the changing face of analytics and the streamlining of iterative queries through evolved user interfaces. The speaker will cut through hype around “shorter time to insight” and explain how combining Hadoop and SQL-based analytics help companies discover emergent trends hidden in unstructured data, without having to retrain data miners or restaff. In particular, it will highlight changes to Big Data analysis from this paradigm and illustrate stepwise how analysts can now connect to Big Data platforms, assemble working data sets from disparate sources, analyze and mine that data for actionable insight, publish the results as visualizations and for feeding reporting tools, and operationalize Map-Reduce and Big Data outcomes into company workflows – all without touching the command line.
Presentation at the Bernadotte Academy in 2007 by Infosphere CEO Mats Björe about the concept of OSINT. Examples from tools like Silobreake is included
Kirti Vashee, Vice-President Enterprise Translation Sales, Asia Online
Rustin Gibbs, Solutions Architect, Moravia Worldwide
Kirti and Rustin provide insights into an innovative approach to the practical use of MT in situations where the bilingual data is of insufficient volume and the monolingual data is of unclear relevance Kirti and Rustin provide examples from travel and publishing industries to show the individual steps of the process to equip participants with information on what language and language technology tools exist to build a high-quality translation engine.
Hadoop World 2011: Hadoop Trends & Predictions - Vanessa Alverez, ForresterCloudera, Inc.
Hadoop is making its way into the enterprise, as organizations look to extract valuable information and intelligence from the mountains of data in their storage environments. The way in which this data is analyzed and stored is changing, and Hadoop has become a critical part of this transformation. In this session, Vanessa will cover the trends we are seeing in the enterprise in regards to Hadoop adoption and how it’s being used, as well as predictions on where we see Hadoop and Big Data in general, going as we enter 2012.
EMC World 2012 : Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data. This session reviews the practices of performing analytics using unstructured data with Hadoop.
First slide of Hadoop:
* Introduction to Big Data and Hadoop:
- Presenting and defining big data
- Introducing Hadoop and History
- Hadoop - how it works?
- HDFS
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.
EMC World 2012 : Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data. This session reviews the practices of performing analytics using unstructured data with Hadoop.
First slide of Hadoop:
* Introduction to Big Data and Hadoop:
- Presenting and defining big data
- Introducing Hadoop and History
- Hadoop - how it works?
- HDFS
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
Hadoop is emerging as the preferred solution for big data analytics across unstructured data. Using real world examples learn how to achieve a competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data.
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
Александр Козлов, Cloudera Inc.
Александр Козлов, старший архитектор в Cloudera Inc., работает с большими компаниями, многие из которых находятся в рейтинге Fortune 500, над проектами по созданию систем анализа большого количества данных. Закончил аспирантуру физического факультета Московского государственного университета, после чего также получил степень Ph.D. в Стэнфорде. До Cloudera и после окончания учебы работал над статистическим анализом данных и соответствующими компьютерными технологиями в SGI, Hewlett-Packard, а также стартапе Turn.
Тема доклада
Контроль зверей: инструменты для управления и мониторинга распределенных систем от Cloudera.
Тезисы
Поддержание распределенных систем, состоящих из тысяч компьютеров, является сложной задачей. Компания Cloudera, которая специализируется на создании распределенных технологий, разработала набор средств для централизованного управления распределенных Hadoop/HBase кластеров. Hadoop и HBase являются проектами Apache Software Foundation, и их применение для анализа частично структурированных данных ускоряется во всем мире. В этом докладе будет рассказано о SCM, системе для конфигурации, настройки, и управления Hadoop/HBase и Activity Monitor, системе для мониторинга ряда ОС и Hadoop/HBase метрик, а также об особенностях подхода Cloudera в отличие от существующих решений для мониторинга (Tivoli, xCat, Ganglia, Nagios и т.д.).
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
"Amr Awadallah served as the VP of Engineering of Yahoo's Product
Intelligence Engineering (PIE) team for a number of years. The PIE
team was responsible for business intelligence and advanced data
analytics across a number of Yahoo's key consumer facing properties (search, mail, news, finance, sports, etc). Amr will share the data architecture that PIE had implementted before Hadoop was deployed and the headaches that architecture entailed. Amr will then show how most, if not all of these headaches were eliminated once Hadoop was deployed. Amr will illustrate how Hadoop and Relational Database complement each other within the traditional business intelligence data stack, and how that enables organizations to access all their data under different
operational and economic constraints."
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseCloudera, Inc.
The power of Hadoop lies in its ability to help users cost effectively analyze all kinds of data. We are now seeing the emergence of a new class of analytic applications that can only be enabled by a comprehensive big data platform. Such a platform extends the Hadoop framework with built-in analytics, robust developer tools, and the integration, reliability, and security capabilities that enterprises demand for complex, large scale analytics. In this session, we will share innovative analytics use cases from actual customer implementations using an enterprise-class big data analytics platform.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...Cloudera, Inc.
451 Analyst Matt Aslett, Cloudera CEO Mike Olson and Cloudera customers RIM and YP (formerly AT&T Interactive) to learn:
» Why Cloudera customers have chosen CDH to get started with Hadoop
» The business value resulting from analyzing new data sources in new ways
» How Hadoop will change these Customers’ business and industry over the next 3-5 years
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...Cloudera, Inc.
Who is contributing to the Hadoop ecosystem, what are they contributing, and why? Who are the vendors that are supplying Hadoop-related products and services and what do they want from Hadoop? How is the expanding ecosystem benefiting or damaging the Apache Hadoop project? What are the emerging alternatives to Hadoop and what chance do they have? In this session, the 451 Group will seek to answer these questions based on their latest research and present their perspective of where Hadoop fits in the total data management landscape.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
An overview of securing Hadoop. Content primarily by Balaji Ganesan, one of the leaders of the Apache Argus project. Presented on Sept 4, 2014 at the Toronto Hadoop User Group by Adam Muise.
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.
What is Hadoop brief intro for Georgian Partners CTO Conference. This outlines the origins of Open Source Apache Hadoop and how Hortonworks fits into this picture. There is also a brief introduction to YARN, the new resource negotiation layer.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
31. Customer Risk Analysis
Build comprehensive data picture of customer side risk
Publish a consolidated set of attributes for analysis
Map ratings across products
Parse and aggregate data from difference sources
Credit and debit cards, product payments, deposits and savings
Banking activity, browsing behavior, call logs, e-mails and chats
Merge data into a single view
A “fuzzy join” among data sources
Structure and normalize attributes
Sentiment analysis, pattern recognition
31
Copyright 2010 Cloudera Inc. All rights reserved
32. Surveillance and Fraud Detection
Trade surveillance records activity in a central
repository
Centralized logging across all execution platforms
Structured and raw log data from multiple applications
Pattern recognition detect anomalies/harmful behavior
Feature set and timeline vector are very dynamic
Schema on read provides flexibility for analysis
Data is primarily served and processed in HDFS with MR
Data filtering and projection in Pig and Hive
Statistical modeling of data sets in R or SAS
32
Copyright 2010 Cloudera Inc. All rights reserved
33. Central Data Repository
Financial Data messy due to many interacting systems
Personal data is obfuscated for security and records get out of sync
Trades need to be “sessionized” into accounts and products
Discrepancies are difficult to reconcile, need to track corrections
Hadoop is a centralized platform for data collection
Single source for data, processing happens on the platform
Metadata used to track information lifecycle
Workflows run and monitor data transformation pipelines
Data served via APIs or in Batch
Single version of the truth, data processed and cleansed centrally
Clear audit trail of data dependencies and usage
33
Copyright 2010 Cloudera Inc. All rights reserved
38. Genomics
Cost of DNA Sequencing Falling Very Fast
Raw data needs to be aligned and matched
Scientists want to collect and analyze these sequences
Hadoop Can Read Native Format
hadoop-bam Java library for manipulation of Binary Alignment/Map
Alignment, SNP discovery, genotyping
Genomic Tools Based On Hadoop
SEAL – distributed short read alignment
BlastReduce – parallel read mapping
Crossbow – whole genome re-sequencing analysis
Cloudburst - sensitive MapReduce alignment
38
Copyright 2010 Cloudera Inc. All rights reserved
39. Utilities and the Power Grid
Power grid is aging and maintained incrementally
Failures hard to predicate and can have cascading effects
Looking at vibration of transformers over time to find patterns
Predicting failure of grid equipment
Supervised learning to scan time series data for fuzzy patterns
Identify likely faulting equipment for targeted replacement
Hadoop based tools to model equipment behavior
openPDC project: http://openpdc.codeplex.com
Lumberyard - indexing time series data for low latency fuzzy queries
39
Copyright 2010 Cloudera Inc. All rights reserved
40. Smart Meter Example Workflow
Looking at usage patterns in home smart meter data
How to educate consumers to save energy
Capacity planning for the grid
Individual analysis is critical
Personalized reporting to consumers
Predictive modeling of peak usage and potential cost savings
Hadoop for collection, reporting and analysis
Collect time series samples in Hadoop
Partition at various granularities and roll up reports and models
40
Copyright 2010 Cloudera Inc. All rights reserved
43. Processing Seismic Data
Optimize the IO-intensive phases of seismic processing
Incorporate additional parallelism where it makes sense
Simplify gather/transpose operations with MapReduce
Seismic Unix for Core Algorithms
Well-known, used at many grad programs in geophysics
SU file format can be easily transformed for processing on HDFS
Hadoop Streaming
Seismic Unix, SEPlib, Javaseis - non-Java code in MR
Framework is aware of parameter files needed by SU commands
Copyright 2011 Cloudera Inc. All rights reserved
46. Brands and Sentiment Analysis
Internet generates a lot of chatter about brands
Understanding what’s being said is crucial to protecting brand value
Facebook, Twitter generate a lot of data for a global top brand
Capturing and Processing direct feedback
Better engagement and alerting via Sentiment Analysis
Not yet ready for fully automated customer service
Hadoop handles the diverse data types and processing
Sources of data changing and semantics continuously evolving
Sophistication of algorithms is improving daily
46
Copyright 2010 Cloudera Inc. All rights reserved
47. Point of Sale Transaction Analysis
Lot’s of machine generated data available
Line items, stock, coupons, ads
Stored in various formats
Pattern recognition enables constant reassessment
Optimizing across multiple data sources
Demand prediction based on
Joining multiple data sets for more insight
Retail Supply Chain
Weather and Financial data
47
Copyright 2010 Cloudera Inc. All rights reserved
56. Recommendations and Forecasting
Collect and serve personalization information
Wide variety of constantly changing data sources
Data guaranteed to be messy
Data ingestion includes collection of raw data
Filtering and fixing of poorly formatted data
Normalization and matching across data sources
Analysis looks for reliable attributes and groupings
Interpretation (e.g. gender by name)
Aggregation across likely matching identifiers
Identify possible predicted attributes or preferences
56
Copyright 2010 Cloudera Inc. All rights reserved
59. Why invest in training?
• Maximize your investment in a new
technology
• Make fewer mistakes by learning the best
practices
• Cheaper and easier to cross-train than
hire
– Existing DBAs, Analysts and System
Administrators can become Hadoop users
61. Value proposition of private training
• 12k/day for up to 20 students
– NEW: 8k/day for up to 10 students
– Price includes courseware, lab materials, cert
vouchers (for Dev, Admin, HBase), and T&E
• We can tailor a class
– We have ~ 3 weeks of content that we can mix
and match into a customized class
– Saves the customer’s time by covering the most
relevant topics, cutting out non essential material
• Customer chooses location and date
• We’re under NDA
Customers experience many pain points when leveraging this architecture for Big Data. Here are 3 of the most common.
Hadoop typically solves two types of problems:Advanced AnalyticsData processingThese go by different terms in different industriesThe applicability of these solutions is broadWe’ve successfully deployed Hadoop and helped solve a diverse set of business problems
FinSvc companies are realizing that they need to understand the fundamental risk in their customer base.All of a bank’s working capital originals with customers.Being able to better predict fluctuations can help them optimize how to put that capital to work.
FinSvc need to analyze trades both for regulatory requirements as well as for internal surveillance and detecting fraud (internal and external).To date this primarily involves looking at transactions and sampling data.Hadoop enables access to detailed data and non-transactional data.
FinSvc companies have many data sources and many consumers of data.Multiple data processing paths can lead to discrepancies in data as well as redundancies in work.A central repository manages all in bound data, takes requests for processing and delivers data sets.This makes the data reliable and traceable. Also FinSvc data is messy and often needs to be updated or restated.A central location can improve tracing all the dependent data sets that need to be reprocessed.
Bank is becoming increasingly competitive, very similar to retail.It used to be you banked with your location credit union for life.Now every company you have a different 401k, you have some 529s somewhere, checking, mortgage, etc.Competitive pressure has driven down fees (despite recent complaints about new fees).Banks now need to compete on what they can offer on top of the ubiquitous financial products.Enter personalized asset management – merge financial models of market trends with personalized portfolios and goals.Embarrassingly parallel, can be offered self-service or via a sales person.
Assessing actual risk exposure in investments is incredible complex.Multi-tiered instruments have lots of variables.Trends that cross the instruments have complex relationships.This is all well structured data with intricate and fluid relationships.Add that the trade volumes have skyrocketed and this clearly becomes a Hadoop problem.
There are regulatory requirements for trade analytics (e.g. RegNMS) that need to be audited.The margins on trades can be razor thin and there’s value in analyzing trade performance.Trade execution platforms and algorithms are incredibly complicated.This is timeseries data, which looks a lot like clickstream data.Tracing particular trades through systems – in effect sessionizing them – and comparing to performance metrics is a classic Hadoop problem.
There’s a yearly revolution in life sciences every time the cost of sequencing falls and the throughput doubles.The existing HPC systems can’t keep up with the amount of data.Hadoop allows scientists to combine data and processing into one scale out gridThere are already numerous libraries available to tackle these problems
A big challenge in our electrical grid is that the infrastructure has grown incrementally over the past 100 yearsWe can’t wholesale replace it – both because of cost and riskIn order to prevent brown outs and black outs caused by component failure the TVA (responsible for the east coast electrical grid) is analyzing for patterns that can predict likely failure.This uses a combination of supervised learning and time series indexing to detect and analyze how components are behaving
Smart Meters are opening up a whole new world of data about how people consumeelectricity (vs how it’s delivered).There are two particular focuses initially – one is to turn this data into education to help consumers be smarter about their electrical use.The other is to help in better capacity planning.
An area you might not consider as being on the cutting edge of technology is in biodiversity indexing.One of the advantages of Hadoop is that it can store any kind of data in any format.It gives you the ability to cleanse that data repeatedly and turn it into well defined structured data.If you need to adjust how you tackle that data, it’s always available in raw form.The final results can be served out of a traditional database or HBase.
We relay today on networks as much as we rely on electricityThis puts a heavy strain on the underlying network infrastructure.Closely monitoring those networks results in a flood of data (the largest network we’re aware of collects several hundred TB/day).Much of the monitoring is data exhaust – not fundamentally required to operating the Network but highly indicative of how it is functioning.
Seismic readings generate massive data volumes when mapping out the topology of the planet.These are typically collected on large storage farms keeping only sampled or aggregated measurements.Then they’re transferred to HPC grids to perform the complex model definition.Hadoop opens the door towards using standard well known libraries in parallel and run them on the same grid that is storing the data.This reduces the need for sampling and significantly speeds up processing.
Companies have been able to analyze customer churn based on when other customers are leaving. Hadoop for the first time helps them capture behaviors leading up to customer loss to help predict when these events are likely.This gives companies more time to respond to possible customer loss.This involves traversing the social graph (customers rarely leave one at a time) and identifying and recognizing patterns that are leading indicators.
Much of the discussions about brands today happens in the social media.This not only impacts the companies perception but can have a direct influence on relationships with customers and the ability to sell.Hadoop is a natural solution for gathering and contextualizing discussions about company brands and products.
Point of sale analysis includes many different types of data today, from standard POS data to online, coupon based and mixed.Companies need to track data from any different sources in different formats to understand their sales in depthHadoop can be used to better understand the supply chain or to incorporate external data to explain sales behaviors.
It used to be that prices were set varying by region or season and and updated periodically.Today pricing can be completely dynamic – especially for online retailers.And consumers are able to comparison shop with a few keystrokes.Customers also weigh the value of their purchase with time to delivery.Taking all these behaviors into account in a hyper competitive market is complex.Hadoop is being used to tackle these challenges and new techniques are being applied to understand correlations, effects of bundles and incentive discountsAnd to cluster customers by a variety of attributes, not just as one type of consumer or another.
Customer loyalty used to be taken for granted. The programs were designed to help track customer purchases with finer granularity.Today customer loyalty is being used to bridge the gap between purchases. When customers can easily comparison shop, it’s not clear the incentives to stay with the same vendor.Loyalty programs are being designed not just to track or encourage customers to shop but to build a relationship with the customer.So that the next time they shop, they prefer the brand that has been thinking of them and their needs.Loyalty programs can also be used to make timely offers, for example when a customer is expected to run out of a particular product, provide a coupon that offers an upsell.
The Internet has expanded the world of offers from candy and magazines while to wait in the checkout line to anywhere and everywhere.Using modern ad network, companies can track their customers after they’ve left their site.This opens up possibilities to re-capture customers who have not yet bought or to cross sell and upsell even after the transaction is complete.Customers use technologies such as HBase to incrementally monitor where customers are going.Algorithms can then be run on incremental data at a variety of time scales.
An online media group within a larger brand name company has multiple separately branded and operated sitesEach has different systems for logs including ad logs and ops logs and different techniques for processing them.Hadoop provides a centralized platform for all of these properties to collect their system logs, ad logs and ops logsHadoop is also loaded with website feeds from 3rd party providers and operational metricsThis creates a standard platform for analytics and reportingThey’re soon turning on exploratory access and will provide centralized storage services for all properties
A mobile ad platform measures standard metrics but most of the data is arbitrary text since it can be defined by 3rd party developersThere are multiple SLAs for reporting to advertisers as well as for data accuracyLog data is collected into HDFS and prepped then loaded into HBaseHBase is used to serve results to advertisers in a similar fashion to general purpose online analytics services
And online gaming vendor has multiple silos for each user interaction (registration, payments, game play, web interaction)The most popular games are very dynamic (simulating real world sports)The first goal is to grant multiple business access to all of the dataIn particular the game play metrics (telemetry data) is extremely detailed, similar to sensor dataThe second goal is for exploratory analysis for example looking at distributions in game play behavior or for event triggersA lot of the initial analysis is basic count distinct on a wide variety of attributes and combinations of attributes to look for correlated behaviorsHadoop is also used to compute online statistic such as leaderboards
Search quality is measured by the users ability to not only find what they want but complete the transaction or take a next stepUnderstanding the users goals is very difficult and the search trends vary over timeFundamentally improving the service and assessing quality means logging everything into HDFS and rolling up your sleevesThis customer uses Hive mostly for aggregation and sqoops the results into an RDBMS to publish to end usersAnalytics have now become a critical part of the service (e.g. generating predictive search)Now they are focusing on where analytic needs are growing and what new data about searches the business wants to see
Recommendation engines are popular applications on HadoopThere are a wide variety of constantly changing sources and the data is always messyAt data ingestion this requires filtering and fixed of poorly formatted dataThese process are constantly changing as the data changesData is then normalized and matched across data sourcesIn some cases this means interpretation and filling in fields, in other cases it involved aggregation across fuzzy matched identifiersThese also require quality checks
Measuring influence on the internet involves collecting a fire hose of data that includes opinions, references and linksThink of this as a very messy and very dynamic page rank but you’re ranking people and brandsHadoop is used to prep all the data – identify meta data and distinct topics (which change)Hadoop is also used to score the social graph and filter out bots and spamThis is all tied together with pig and java and coordinate with OozieData is then batch served in CSV and loaded into HBase to back an API