The document discusses how Orbitz Worldwide integrated Hadoop into its enterprise data infrastructure to handle large volumes of web analytics and transactional data. Some key points:
- Orbitz used Hadoop to store and analyze large amounts of web log and behavioral data to improve services like hotel search. This allowed analyzing more data than their previous 2-week data archive.
- They faced initial resistance but built a Hadoop cluster with 200TB of storage to enable machine learning and analytics applications.
- The challenges now are providing analytics tools for non-technical users and further integrating Hadoop with their existing data warehouse.
The document discusses how Orbitz Worldwide uses Hadoop and big data to drive web analytics. It faces challenges with processing massive amounts of log data from millions of searches. Orbitz implemented a Hadoop infrastructure to provide long-term storage, access for developers and analysts, and rapid deployment of reporting applications. This allows Orbitz to aggregate data, run analysis jobs like traffic source mapping in minutes rather than hours, and generate over 25 million records per month. The implementation helps Orbitz shift analytics from innovation to mainstream use across business units.
Benchmarking Digital Readiness: Moving at the Speed of the MarketApigee | Google Cloud
This document discusses how companies can benchmark their digital readiness and move faster in the digital market. It finds that digital leaders who adopt apps, APIs, and data analytics outperform digital laggards. To move up, companies need business and technology leadership. They should think strategically about customer experience, operations, data, and innovation to access new revenue channels beyond direct monetization. Technologically, companies should take a "cloud first" and "outside in" approach to deliver fast, differentiated customer experiences through systems of engagement built on APIs and backends.
Modernizing Architecture for a Complete Data StrategyCloudera, Inc.
The document outlines a presentation about modernizing data strategies. It discusses how companies' relationships with data are changing and the business drivers for adopting big data and analytics. It then provides guidance on building a modern data strategy, emphasizing the importance of people, process, and technology. Specifically, it recommends starting with high-impact use cases, staying agile, and evolving capabilities over time to maximize value from data. The presentation also covers how Hadoop is being used for different workloads in both on-premise and cloud environments.
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Cloudera, Inc.
Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
A modern, flexible approach to Hadoop implementation incorporating innovations from HP Haven
Jeff Veis
Vice President
HP Software Big Data
Gilles Noisette
Master Solution Architect
HP EMEA Big Data CoE
5 Myths about Spark and Big Data by Nik RoudaSpark Summit
The document discusses 5 myths about Spark and big data, and where big data is going next. It provides data from a survey of 475 IT and business professionals on their big data strategies, priorities, and technology evaluations. Key findings include that both business and IT stakeholders are initiating big data projects, with business value expected within 1 year for most. Important evaluation criteria for big data solutions center on traditional enterprise requirements like security, performance and cost rather than just the "5 Vs". Many organizations express interest in Spark for its machine learning and SQL capabilities.
This presentation will discuss the stories of 3 companies that span different industries; what challenges they faced and how cloud analytics solved for them; what technologies were implemented to solve the challenges; and how they were able to benefit from their new cloud analytics environments.
The objectives of this session include:
• Detail and explain the key benefits and advantages of moving BI and analytics workloads to the cloud, and why companies shouldn’t wait any longer to make their move.
• Compare the different analytics cloud options companies have, and the pros and cons of each.
• Describe some of the challenges companies may face when moving their analytics to the cloud, and what they need to prepare for.
• Provide the case studies of three companies, what issues they were solving for, what technologies they implemented and why, and how they benefited from their new solutions.
• Learn what to look for one considering a partner and trusted advisor to assist with an analytics cloud migration.
The document discusses how Orbitz Worldwide uses Hadoop and big data to drive web analytics. It faces challenges with processing massive amounts of log data from millions of searches. Orbitz implemented a Hadoop infrastructure to provide long-term storage, access for developers and analysts, and rapid deployment of reporting applications. This allows Orbitz to aggregate data, run analysis jobs like traffic source mapping in minutes rather than hours, and generate over 25 million records per month. The implementation helps Orbitz shift analytics from innovation to mainstream use across business units.
Benchmarking Digital Readiness: Moving at the Speed of the MarketApigee | Google Cloud
This document discusses how companies can benchmark their digital readiness and move faster in the digital market. It finds that digital leaders who adopt apps, APIs, and data analytics outperform digital laggards. To move up, companies need business and technology leadership. They should think strategically about customer experience, operations, data, and innovation to access new revenue channels beyond direct monetization. Technologically, companies should take a "cloud first" and "outside in" approach to deliver fast, differentiated customer experiences through systems of engagement built on APIs and backends.
Modernizing Architecture for a Complete Data StrategyCloudera, Inc.
The document outlines a presentation about modernizing data strategies. It discusses how companies' relationships with data are changing and the business drivers for adopting big data and analytics. It then provides guidance on building a modern data strategy, emphasizing the importance of people, process, and technology. Specifically, it recommends starting with high-impact use cases, staying agile, and evolving capabilities over time to maximize value from data. The presentation also covers how Hadoop is being used for different workloads in both on-premise and cloud environments.
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Cloudera, Inc.
Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
A modern, flexible approach to Hadoop implementation incorporating innovations from HP Haven
Jeff Veis
Vice President
HP Software Big Data
Gilles Noisette
Master Solution Architect
HP EMEA Big Data CoE
5 Myths about Spark and Big Data by Nik RoudaSpark Summit
The document discusses 5 myths about Spark and big data, and where big data is going next. It provides data from a survey of 475 IT and business professionals on their big data strategies, priorities, and technology evaluations. Key findings include that both business and IT stakeholders are initiating big data projects, with business value expected within 1 year for most. Important evaluation criteria for big data solutions center on traditional enterprise requirements like security, performance and cost rather than just the "5 Vs". Many organizations express interest in Spark for its machine learning and SQL capabilities.
This presentation will discuss the stories of 3 companies that span different industries; what challenges they faced and how cloud analytics solved for them; what technologies were implemented to solve the challenges; and how they were able to benefit from their new cloud analytics environments.
The objectives of this session include:
• Detail and explain the key benefits and advantages of moving BI and analytics workloads to the cloud, and why companies shouldn’t wait any longer to make their move.
• Compare the different analytics cloud options companies have, and the pros and cons of each.
• Describe some of the challenges companies may face when moving their analytics to the cloud, and what they need to prepare for.
• Provide the case studies of three companies, what issues they were solving for, what technologies they implemented and why, and how they benefited from their new solutions.
• Learn what to look for one considering a partner and trusted advisor to assist with an analytics cloud migration.
This document provides tips for successfully implementing Hadoop. It recommends starting with a well-defined use case using a small team on the cloud to test feasibility. It also stresses the importance of skills training given Hadoop's complexity, and adjusting processes and solutions based on data freshness needs rather than pursuing real-time analytics. Democratizing data access through self-service tools is also highlighted to maximize insights from Hadoop implementations.
Rethink Analytics with an Enterprise Data HubCloudera, Inc.
Have you run into one or more of the following barriers or limitations with your existing data warehousing architecture:
> Increasingly high data storage and/or processing costs?
> Silos of data sources?
> Complexity of management and security?
> Lack of analytics agility?
The document provides tips for successfully implementing Hadoop. It recommends starting with a well-defined use case using a small, skilled team on the cloud to quantify costs and benefits before a full rollout. It also stresses the importance of data quality, democratizing access to data through self-service tools, and focusing on insights rather than just real-time data. Training and skills development are critical as Hadoop technology and tools continue to evolve rapidly.
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike GualtieriSpark Summit
Spark adoption in enterprises is unstoppable for five key reasons:
1. Enterprises prioritize improved customer experience which requires real-time analysis of vast amounts of customer data.
2. Spark and Hadoop provide a cost-effective way to perform advanced analytics at scale on both internal and external data sources.
3. Spark allows for real-time analysis of data through its in-memory computing capabilities, unlocking the value of perishable insights.
4. Massive machine learning automation enabled by Spark is needed to automate the data science process and gain competitive advantages.
5. The diverse and innovative Spark community is helping drive the development of continuous analytics pipelines required for real-time insights.
This document discusses how big data is transforming business intelligence. It outlines some of the pains of traditional BI, including maintaining large data warehouses and only considering structured data. The document advocates for an open source approach using Hadoop as an "extended data warehouse" to address these issues. Examples of recent Solocal Group projects involving real-time business analytics and a search power selector are provided. Advice is given on how companies can activate big data projects and start the BI transformation.
Hooduku Inc claims to provide solutions for big data needs through refined SaaS offerings. They have over 50 years of combined experience building enterprise systems and products for companies like Amazon and Microsoft. Hooduku can help clients implement best-in-class big data solutions to improve customer reach, costs, and inventory management. They specialize in solutions that provide measurable benefits immediately and substantial returns over 3-6 months.
Hooduku Inc claims to provide solutions for big data needs through implementing tangible technical and business solutions. They have over 50 years of combined experience building enterprise systems for companies like Amazon and Microsoft. They can help customers configure and deploy technologies like Hadoop and Pig to analyze large datasets and provide actionable insights. They provide a case study of implementing a big data solution for a Houston-based natural gas company to acquire pipeline data through SCADA systems and analyze it with Hadoop and SQL Server. They describe how big data analytics could help a retailer like Acme Inc. predict hot products, customer demand patterns, and offer personalized promotions to drive sales.
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
How can we use technology to help the organization make data-driven decision-making part of its organizational DNA, while retaining the context of the business as a whole? How can we imprint data in the culture of the organization and make it easily accessible to everyone? Microsoft directly empowers businesses to derive insights and value from little and big data, through its release of user-friendly analytics through Azure Machine Learning (ML) combined with its acquisition of Revolution Analytics. Power BI can be used to create compelling visual stories around the analysis so that the work is not left to the data consumer. Together, these technologies can be used to make data and analytics part of the organization's DNA.
There are no prerequisites, but attendees are welcome to follow along with the demo if they have an Azure ML and Power BI account and R installed. Files will be released before the session.
2020 Big Data & Analytics Maturity Survey ResultsAtScale
The survey collected responses from over 150 Big Data & Analytics leaders and found that:
1) Most enterprises are adopting a hybrid/multi-cloud strategy rather than a single vendor.
2) Investment in Hadoop is staying the same or increasing for most respondents.
3) Many companies plan to invest in data virtualization which allows data to be accessed consistently across platforms.
4) Data governance was cited as a top challenge across all respondents.
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
Traditional BI systems have limitations in handling big data as they are not designed for unstructured data and have data latency issues. A business data lake provides a new approach by storing all raw structured and unstructured data in a single environment at low cost. This allows for near real-time analysis on any data from any source to gain insights.
1. The document discusses Big Data analytics using Hadoop. It defines Big Data and explains the 3Vs of Big Data - volume, velocity, and variety.
2. It then describes Hadoop, an open-source framework for distributed storage and processing of large data sets across clusters of commodity hardware. Hadoop uses HDFS for storage and MapReduce for distributed processing.
3. The core components of Hadoop are the NameNode, which manages file system metadata, and DataNodes, which store data blocks. It explains the write and read operations in HDFS.
Analytics in a Day Ft. Synapse Virtual WorkshopCCG
Say goodbye to data silos! Analytics in a Day will simplify and accelerate your journey towards the modern data warehouse. Join CCG and Microsoft for a half-day virtual workshop, hosted by James McAuliffe.
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
The “Big Data era” has ushered in an avalanche of new technologies and approaches for delivering information and insights to business users. What is the role of the cloud in your analytical environment? How can you make your migration as seamless as possible? This closing keynote, delivered by Joe Caserta, a prominent consultant who has helped many global enterprises adopt Big Data, provided the audience with the inside scoop needed to supplement data warehousing environments with data intelligence—the amalgamation of Big Data and business intelligence.
This presentation was given as the closing keynote at DBTA's annual Data Summit in NYC.
Reinventing the Modern Information Pipeline: Paxata and MapRLilia Gutnik
(Presented at MapR's Big Data Everywhere event in Redwood City, CA in December 2016)
The relationship between business teams and IT has changed as the complexity of data has increased. A traditional data pipeline designed for an IT-centered approach to information management is not designed for the data demands of today's business decisions. Designing a big data strategy requires modernizing previous approaches. Self-service data preparation in a collaborative, intuitive, governed, and secure environment is the key to a nimble and decisive business unit.
DataOps: Nine steps to transform your data science impact Strata London May 18Harvinder Atwal
According to Forrester Research, only 22% of companies are currently seeing a significant return from data science expenditures. Most data science implementations are high-cost IT projects, local applications that are not built to scale for production workflows, or laptop decision support projects that never impact customers. Despite this high failure rate, we keep hearing the same mantra and solutions over and over again. Everybody talks about how to create models, but not many people talk about getting them into production where they can impact customers.
Harvinder Atwal offers an entertaining and practical introduction to DataOps, a new and independent approach to delivering data science value at scale, used at companies like Facebook, Uber, LinkedIn, Twitter, and eBay. The key to adding value through DataOps is to adapt and borrow principles from Agile, Lean, and DevOps. However, DataOps is not just about shipping working machine learning models; it starts with better alignment of data science with the rest of the organization and its goals. Harvinder shares experience-based solutions for increasing your velocity of value creation, including Agile prioritization and collaboration, new operational processes for an end-to-end data lifecycle, developer principles for data scientists, cloud solution architectures to reduce data friction, self-service tools giving data scientists freedom from bottlenecks, and more. The DataOps methodology will enable you to eliminate daily barriers, putting your data scientists in control of delivering ever-faster cutting-edge innovation for your organization and customers.
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
This document provides an overview of big data and various big data tools including Pig, Hive, and Cascading. It discusses the history and motivation for each tool, how they work by mapping operations to MapReduce jobs, and compares key aspects of their data models, typing, and procedural vs declarative styles. The document is intended as a training presentation on these popular big data frameworks.
Unlocking data science in the enterprise - with Oracle and ClouderaCloudera, Inc.
This document discusses unlocking data science in the enterprise with Cloudera Data Science Workbench. It introduces Cloudera Data Science Workbench as a tool that accelerates data science from development to production. It allows data scientists to use R, Python, or Scala from a web browser to directly access and analyze data stored in Hadoop clusters. Cloudera Data Science Workbench provides secure, self-service environments for data scientists while also giving IT control over security and compliance. The document includes a demo of Cloudera Data Science Workbench's features.
My slides on how to use cloud as a data platform at BigDataWeek 2013 Romania
http://www.eurocloud.ro/en/events/all-there-is-to-know-about-big-data/#.UXZFaUDvlVI
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopCloudera, Inc.
This document discusses how Orbitz is using Hadoop to store and process large amounts of log and transaction data in a scalable and cost-effective way. It outlines how Hadoop enables applications like recommendations, page performance tracking, and user segmentation. The goal is to integrate Hadoop with their existing enterprise data warehouse to provide a unified view of data and leverage existing business intelligence tools. Examples of processing pipelines and use cases for web analytics, beta, and click data are provided.
This document provides an overview of Hadoop and big data use cases. It discusses the evolution of business analytics and data processing, as well as the architecture of traditional RDBMS systems compared to Hadoop. Examples of how companies have used Hadoop include a bank improving risk modeling by combining customer data, a telecom reducing churn by analyzing call logs, and a retailer targeting promotions by analyzing point-of-sale transactions. Hadoop allows these companies to gain valuable business insights from large and diverse data sources.
This document provides tips for successfully implementing Hadoop. It recommends starting with a well-defined use case using a small team on the cloud to test feasibility. It also stresses the importance of skills training given Hadoop's complexity, and adjusting processes and solutions based on data freshness needs rather than pursuing real-time analytics. Democratizing data access through self-service tools is also highlighted to maximize insights from Hadoop implementations.
Rethink Analytics with an Enterprise Data HubCloudera, Inc.
Have you run into one or more of the following barriers or limitations with your existing data warehousing architecture:
> Increasingly high data storage and/or processing costs?
> Silos of data sources?
> Complexity of management and security?
> Lack of analytics agility?
The document provides tips for successfully implementing Hadoop. It recommends starting with a well-defined use case using a small, skilled team on the cloud to quantify costs and benefits before a full rollout. It also stresses the importance of data quality, democratizing access to data through self-service tools, and focusing on insights rather than just real-time data. Training and skills development are critical as Hadoop technology and tools continue to evolve rapidly.
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike GualtieriSpark Summit
Spark adoption in enterprises is unstoppable for five key reasons:
1. Enterprises prioritize improved customer experience which requires real-time analysis of vast amounts of customer data.
2. Spark and Hadoop provide a cost-effective way to perform advanced analytics at scale on both internal and external data sources.
3. Spark allows for real-time analysis of data through its in-memory computing capabilities, unlocking the value of perishable insights.
4. Massive machine learning automation enabled by Spark is needed to automate the data science process and gain competitive advantages.
5. The diverse and innovative Spark community is helping drive the development of continuous analytics pipelines required for real-time insights.
This document discusses how big data is transforming business intelligence. It outlines some of the pains of traditional BI, including maintaining large data warehouses and only considering structured data. The document advocates for an open source approach using Hadoop as an "extended data warehouse" to address these issues. Examples of recent Solocal Group projects involving real-time business analytics and a search power selector are provided. Advice is given on how companies can activate big data projects and start the BI transformation.
Hooduku Inc claims to provide solutions for big data needs through refined SaaS offerings. They have over 50 years of combined experience building enterprise systems and products for companies like Amazon and Microsoft. Hooduku can help clients implement best-in-class big data solutions to improve customer reach, costs, and inventory management. They specialize in solutions that provide measurable benefits immediately and substantial returns over 3-6 months.
Hooduku Inc claims to provide solutions for big data needs through implementing tangible technical and business solutions. They have over 50 years of combined experience building enterprise systems for companies like Amazon and Microsoft. They can help customers configure and deploy technologies like Hadoop and Pig to analyze large datasets and provide actionable insights. They provide a case study of implementing a big data solution for a Houston-based natural gas company to acquire pipeline data through SCADA systems and analyze it with Hadoop and SQL Server. They describe how big data analytics could help a retailer like Acme Inc. predict hot products, customer demand patterns, and offer personalized promotions to drive sales.
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
How can we use technology to help the organization make data-driven decision-making part of its organizational DNA, while retaining the context of the business as a whole? How can we imprint data in the culture of the organization and make it easily accessible to everyone? Microsoft directly empowers businesses to derive insights and value from little and big data, through its release of user-friendly analytics through Azure Machine Learning (ML) combined with its acquisition of Revolution Analytics. Power BI can be used to create compelling visual stories around the analysis so that the work is not left to the data consumer. Together, these technologies can be used to make data and analytics part of the organization's DNA.
There are no prerequisites, but attendees are welcome to follow along with the demo if they have an Azure ML and Power BI account and R installed. Files will be released before the session.
2020 Big Data & Analytics Maturity Survey ResultsAtScale
The survey collected responses from over 150 Big Data & Analytics leaders and found that:
1) Most enterprises are adopting a hybrid/multi-cloud strategy rather than a single vendor.
2) Investment in Hadoop is staying the same or increasing for most respondents.
3) Many companies plan to invest in data virtualization which allows data to be accessed consistently across platforms.
4) Data governance was cited as a top challenge across all respondents.
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
Traditional BI systems have limitations in handling big data as they are not designed for unstructured data and have data latency issues. A business data lake provides a new approach by storing all raw structured and unstructured data in a single environment at low cost. This allows for near real-time analysis on any data from any source to gain insights.
1. The document discusses Big Data analytics using Hadoop. It defines Big Data and explains the 3Vs of Big Data - volume, velocity, and variety.
2. It then describes Hadoop, an open-source framework for distributed storage and processing of large data sets across clusters of commodity hardware. Hadoop uses HDFS for storage and MapReduce for distributed processing.
3. The core components of Hadoop are the NameNode, which manages file system metadata, and DataNodes, which store data blocks. It explains the write and read operations in HDFS.
Analytics in a Day Ft. Synapse Virtual WorkshopCCG
Say goodbye to data silos! Analytics in a Day will simplify and accelerate your journey towards the modern data warehouse. Join CCG and Microsoft for a half-day virtual workshop, hosted by James McAuliffe.
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
The “Big Data era” has ushered in an avalanche of new technologies and approaches for delivering information and insights to business users. What is the role of the cloud in your analytical environment? How can you make your migration as seamless as possible? This closing keynote, delivered by Joe Caserta, a prominent consultant who has helped many global enterprises adopt Big Data, provided the audience with the inside scoop needed to supplement data warehousing environments with data intelligence—the amalgamation of Big Data and business intelligence.
This presentation was given as the closing keynote at DBTA's annual Data Summit in NYC.
Reinventing the Modern Information Pipeline: Paxata and MapRLilia Gutnik
(Presented at MapR's Big Data Everywhere event in Redwood City, CA in December 2016)
The relationship between business teams and IT has changed as the complexity of data has increased. A traditional data pipeline designed for an IT-centered approach to information management is not designed for the data demands of today's business decisions. Designing a big data strategy requires modernizing previous approaches. Self-service data preparation in a collaborative, intuitive, governed, and secure environment is the key to a nimble and decisive business unit.
DataOps: Nine steps to transform your data science impact Strata London May 18Harvinder Atwal
According to Forrester Research, only 22% of companies are currently seeing a significant return from data science expenditures. Most data science implementations are high-cost IT projects, local applications that are not built to scale for production workflows, or laptop decision support projects that never impact customers. Despite this high failure rate, we keep hearing the same mantra and solutions over and over again. Everybody talks about how to create models, but not many people talk about getting them into production where they can impact customers.
Harvinder Atwal offers an entertaining and practical introduction to DataOps, a new and independent approach to delivering data science value at scale, used at companies like Facebook, Uber, LinkedIn, Twitter, and eBay. The key to adding value through DataOps is to adapt and borrow principles from Agile, Lean, and DevOps. However, DataOps is not just about shipping working machine learning models; it starts with better alignment of data science with the rest of the organization and its goals. Harvinder shares experience-based solutions for increasing your velocity of value creation, including Agile prioritization and collaboration, new operational processes for an end-to-end data lifecycle, developer principles for data scientists, cloud solution architectures to reduce data friction, self-service tools giving data scientists freedom from bottlenecks, and more. The DataOps methodology will enable you to eliminate daily barriers, putting your data scientists in control of delivering ever-faster cutting-edge innovation for your organization and customers.
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
This document provides an overview of big data and various big data tools including Pig, Hive, and Cascading. It discusses the history and motivation for each tool, how they work by mapping operations to MapReduce jobs, and compares key aspects of their data models, typing, and procedural vs declarative styles. The document is intended as a training presentation on these popular big data frameworks.
Unlocking data science in the enterprise - with Oracle and ClouderaCloudera, Inc.
This document discusses unlocking data science in the enterprise with Cloudera Data Science Workbench. It introduces Cloudera Data Science Workbench as a tool that accelerates data science from development to production. It allows data scientists to use R, Python, or Scala from a web browser to directly access and analyze data stored in Hadoop clusters. Cloudera Data Science Workbench provides secure, self-service environments for data scientists while also giving IT control over security and compliance. The document includes a demo of Cloudera Data Science Workbench's features.
My slides on how to use cloud as a data platform at BigDataWeek 2013 Romania
http://www.eurocloud.ro/en/events/all-there-is-to-know-about-big-data/#.UXZFaUDvlVI
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopCloudera, Inc.
This document discusses how Orbitz is using Hadoop to store and process large amounts of log and transaction data in a scalable and cost-effective way. It outlines how Hadoop enables applications like recommendations, page performance tracking, and user segmentation. The goal is to integrate Hadoop with their existing enterprise data warehouse to provide a unified view of data and leverage existing business intelligence tools. Examples of processing pipelines and use cases for web analytics, beta, and click data are provided.
This document provides an overview of Hadoop and big data use cases. It discusses the evolution of business analytics and data processing, as well as the architecture of traditional RDBMS systems compared to Hadoop. Examples of how companies have used Hadoop include a bank improving risk modeling by combining customer data, a telecom reducing churn by analyzing call logs, and a retailer targeting promotions by analyzing point-of-sale transactions. Hadoop allows these companies to gain valuable business insights from large and diverse data sources.
This document discusses big data workflows. It begins by defining big data and workflows, noting that workflows are task-oriented processes for decision making. Big data workflows require many servers to run one application, unlike traditional IT workflows which run on one server. The document then covers the 5Vs and 1C characteristics of big data: volume, velocity, variety, variability, veracity, and complexity. It lists software tools for big data platforms, business analytics, databases, data mining, and programming. Challenges of big data are also discussed: dealing with size and variety of data, scalability, analysis, and management issues. Major application areas are listed in private sector domains like retail, banking, manufacturing, and government.
Are you confused by Big Data? Get in touch with this new "black gold" and familiarize yourself with undiscovered insights through our complimentary introductory lesson on Big Data and Hadoop!
This document discusses big data and Hadoop. It defines big data as high volume data that cannot be easily stored or analyzed with traditional methods. Hadoop is an open-source software framework that can store and process large data sets across clusters of commodity hardware. It has two main components - HDFS for storage and MapReduce for distributed processing. HDFS stores data across clusters and replicates it for fault tolerance, while MapReduce allows data to be mapped and reduced for analysis.
As a follow-on to the presentation "Building an Effective Data Warehouse Architecture", this presentation will explain exactly what Big Data is and its benefits, including use cases. We will discuss how Hadoop, the cloud and massively parallel processing (MPP) is changing the way data warehouses are being built. We will talk about hybrid architectures that combine on-premise data with data in the cloud as well as relational data and non-relational (unstructured) data. We will look at the benefits of MPP over SMP and how to integrate data from Internet of Things (IoT) devices. You will learn what a modern data warehouse should look like and how the role of a Data Lake and Hadoop fit in. In the end you will have guidance on the best solution for your data warehouse going forward.
The document discusses how big data analytics can transform the travel and transportation industry. It notes that these industries generate huge amounts of structured and unstructured data from various sources that can provide insights if analyzed properly. Hadoop is one tool that can help manage and process large datasets in parallel across clusters of servers. The document discusses how sensors in vehicles and infrastructure can provide real-time data on performance, maintenance needs, inventory levels, and more. This data, combined with analytics, can help optimize operations, improve customer experiences, predict issues, and increase efficiency across the transportation sector. It emphasizes that companies must develop data science skills and implement new technologies to fully leverage big data for strategic advantage.
Every second of every day you hear about Electronic systems creating ever increasing quantities of data. Systems in markets such as finance, media, healthcare, government and scientific research feature strongly in the Big Data processing conversation. While extracting business value from Big Data is forecast to bring customer and competitive advantage and benefits. In this session hear Vas Kapsalis, NetApp Big Data Business Development Manager, discuss his views and experience on the wider world of Big Data.
The document discusses big data analysis and provides an introduction to key concepts. It is divided into three parts: Part 1 introduces big data and Hadoop, the open-source software framework for storing and processing large datasets. Part 2 provides a very quick introduction to understanding data and analyzing data, intended for those new to the topic. Part 3 discusses concepts and references to use cases for big data analysis in the airline industry, intended for more advanced readers. The document aims to familiarize business and management users with big data analysis terms and thinking processes for formulating analytical questions to address business problems.
Big Data is still a challenge for many companies to collect, process, and analyze large amounts of structured and unstructured data. Hadoop provides an open source framework for distributed storage and processing of large datasets across commodity servers to help companies gain insights from big data. While Hadoop is commonly used, Spark is becoming a more popular tool that can run 100 times faster for iterative jobs and integrates with SQL, machine learning, and streaming technologies. Both Hadoop and Spark often rely on the Hadoop Distributed File System for storage and are commonly implemented together in big data projects and platforms from major vendors.
This document provides an overview of big data and how to start a career working with big data. It discusses the growth of data from various sources and challenges of dealing with large, unstructured data. Common data types and measurement units are defined. Hadoop is introduced as an open-source framework for storing and processing big data across clusters of computers. Key components of Hadoop's ecosystem are explained, including HDFS for storage, MapReduce/Spark for processing, and Hive/Impala for querying. Examples are given of how companies like Walmart and UPS use big data analytics to improve business decisions. Career opportunities and typical salaries in big data are also mentioned.
This document discusses big data business opportunities and solutions. It notes that big data solutions are tailored to specific data types and workloads. Common business domains for big data include web analytics, clickstream analysis using the ELK stack, and big data in the cloud to provide auto-scaling, low costs, and use of cloud services. Effective big data solutions require data governance, cluster modeling, and analytics and visualization.
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsJane Roberts
The document discusses modernizing enterprise data warehouses to handle big data by migrating workloads to a Hadoop-based data lake. It describes challenges with existing data warehouses and outlines Impetus's automated data warehouse workload migration tool which can help organizations migrate schemas, data, queries and access controls to Hadoop to realize the benefits of big data analytics while protecting existing investments.
Creating a Next-Generation Big Data ArchitecturePerficient, Inc.
If you’ve spent time investigating Big Data, you quickly realize that the issues surrounding Big Data are often complex to analyze and solve. The sheer volume, velocity and variety changes the way we think about data – including how enterprises approach data architecture.
Significant reduction in costs for processing, managing, and storing data, combined with the need for business agility and analytics, requires CIOs and enterprise architects to rethink their enterprise data architecture and develop a next-generation approach to solve the complexities of Big Data.
Creating the data architecture while integrating Big Data into the heart of the enterprise data architecture is a challenge. This webinar covered:
-Why Big Data capabilities must be strategically integrated into an enterprise’s data architecture
-How a next-generation architecture can be conceptualized
-The key components to a robust next generation architecture
-How to incrementally transition to a next generation data architecture
The document provides an overview of Perficient, a leading information technology consulting firm, and their big data architectural series webinar on creating a next-generation big data architecture. The webinar discusses big data business use cases, the Hadoop ecosystem, realizing a Hadoop-centric architecture through different architectural roles for Hadoop including analytics, data warehousing, stream processing, data integration and transactional data stores. It also covers challenges in moving from potential to reality and provides recommendations for integrating Hadoop into the enterprise.
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
This document discusses big data, including its definition, characteristics, and architecture capabilities. It defines big data as large datasets that are challenging to store, search, share, visualize, and analyze due to their scale, diversity and complexity. The key characteristics of big data are described as volume, velocity and variety. The document then outlines the architecture capabilities needed for big data, including storage and management, database, processing, data integration and statistical analysis capabilities. Hadoop and MapReduce are presented as core technologies for storage, processing and analyzing large datasets in parallel across clusters of computers.
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
Today, practically every firm uses big data to gain a competitive advantage in the market. With this in mind, freely available big data tools for analysis and processing are a cost-effective and beneficial choice for enterprises. Hadoop is the sector’s leading open-source initiative and big data tidal roller. Moreover, this is not the final chapter! Numerous other businesses pursue Hadoop’s free and open-source path.
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
The document discusses a Big Data Meetup organized by C-BAG (Chennai Big Data Analytic Group) on October 29, 2014 in Chennai. It provides details about two speakers, Dhruv Kumar from Concurrent Inc. and Vinay Shukla from Hortonworks, who will discuss reducing development time for production-grade Hadoop applications and Hortonworks' Hadoop platform respectively. The remainder of the document consists of presentation slides that cover topics including the modern data architecture with Hadoop, enterprise goals for data architecture, unlocking applications from new data types, and case studies.
Learn About Big Data and Hadoop The Most Significant ResourceAssignment Help
Data is now one of the most significant resources for businesses all around the world because of the digital revolution. However, the ability to gather, organize, process, and evaluate huge volumes of data has altered the way businesses function and arrive at educated decisions. Managing and gleaning information from the ever-expanding marine environments of information is impossible without Big Data and Hadoop. Both of which are at the vanguard of this data revolution.
If you have selected a programming language, and have difficulties writing the best assignment, get the assistance of assessment help experts to learn more about it. In this blog, we will look at the basics of Big Data and Hadoop and how they work. However, we will also explore the nature of Big Data. Also, its defining features, and the difficulties it provides. We'll also take a look at how Hadoop, an open-source platform, has become a frontrunner in the race to solve the challenges posed by Big Data. These fully appreciate the potential for change of Big Data and Hadoop for businesses across a wide range of sectors. It is necessary first to grasp the central position that they play in current data-driven decision-making.
Similar to Gartner peer forum sept 2011 orbitz (20)
The document discusses the importance of customer discovery in product development. It advocates searching for and understanding customers at every step, as technology alone is not a business. The core belief expressed is that product discovery cannot be done in isolation from customer discovery. Some tips provided include practicing lean product development, having continuous integration and continuous delivery at the core, focusing on customers through hypothesis testing and being agile in goals and roadmaps.
Why don’t we hear as much in the BI space when it comes to Agile? This is probably attributed to the ad-hoc nature of BI projects and fluid nature of End Goals. We over the last few years tried to change this perception and leverage the goodness and effectiveness of being Agile in the BI space. BI is so vast that there are different areas where you can bring in the principles and process of Agile. We are incorporating XP, Scrum and Kanban principles to transform the way we do development and support our businesses. We were also successful in incorporating CI within the BI area using tools such as Jenkins, Stash and Git.
Traditional BI uses waterfall methodology which is slow. Disruptive BI leverages agile principles like continuous integration, automation, and cloud to enable real-time analytics. It requires data scientists, programmers passionate about data, and big data technologies. The key things to take away are that agile is a mindset, continuous delivery is critical, and technology enables possibilities.
This presentation is for folks who wants to understand the 101 of stock trading. Most of the information is available online in the mentioned reference websites.
Go over a few case studies which have had a great impact on travel experience for users and the technology behind them.
Usage of machine learning to provide personalized sorting of search results which in turn increased the propensity to buy.
Collecting and analyzing Site experimentation data.
Marketing Channel Optimization and Campaign effectiveness
Data platform that enables all of our use cases
Big Data Analytics from a Practitioners ViewRaghu Kashyap
Raghu Kashyap is the Director of the Data Insights Group at Orbitz Worldwide. He has over 13 years of experience in technology and 4 years experience in analytics and big data. At Orbitz, he is responsible for analytics, site insights, competitive intelligence, and supporting big data teams. Orbitz generates hundreds of gigabytes of log data daily from millions of air and hotel searches. They implemented Hadoop to cost-effectively store and analyze this large volume of data. Using Hadoop, Orbitz is able to perform machine learning, advanced analytics, marketing optimization, and other insights that were previously not possible. Kashyap emphasizes the importance of data governance, leadership buy-in,
Big Data redefines Enterprise Data Warehouse @BangaloreRaghu Kashyap
This document discusses how Big Data and Hadoop can redefine the traditional enterprise data warehouse approach. It describes Raghu Kashyap's role leading Big Data initiatives at Orbitz Worldwide and an approach using Hadoop and ETL to ingest raw logs and resolve database keys more efficiently than traditional ETL-only approaches. Examples of Hadoop applications at Orbitz include site analytics, machine learning, and multi-channel attribution. Lessons learnt include the importance of data governance and leadership buy-in for Big Data analytics.
Raghu Kashyap and Jami Timmons of Orbitz Worldwide discuss big data challenges including having the right reasons for using big data, centralized decentralization to break data into smaller chunks, and showing financial returns. They provide tips on unlocking data insights, gaining organizational buy-in to overcome resistance, establishing data governance, acknowledging short-term winners and losers, and that the journey with big data is long-term.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
1. Architecting for Big Data Integrating Hadoop into an Enterprise Data Infrastructure Raghu Kashyap and Jonathan Seidman Gartner Peer Forum September 14 | 2011
2.
3. page Launched in 2001, Chicago, IL Over 160 million bookings
4.
5.
6. Why We Started Using Hadoop page Optimizing hotel search…
7.
8.
9. Hadoop Was Selected as a Solution… page Transactional Data (e.g. bookings) Data Warehouse Non-Transactional Data (e.g. searches) Hadoop
10.
11. Current Big Data Infrastructure Hadoop page MapReduce HDFS MapReduce Jobs (Java, Python, R/RHIPE) Analytic Tools (Hive, Pig) Data Warehouse (Greenplum) psql, gpload, Sqoop External Analytical Jobs (Java, R, etc.) Aggregated Data Aggregated Data
22. Click Data Processing – Current Data Warehouse Processing page Web Server Logs ETL DW Data Cleansing (Stored procedure) DW Web Server Web Servers 3 hours 2 hours ~20% original data size
23. Click Data Processing – Proposed Hadoop Processing page Web Server Logs HDFS Data Cleansing (MapReduce) DW Web Server Web Servers
Welcome everyone. I will be presenting on how we are shaping up web analytics and big data to optimize the data driven decisions at Orbitz World wide 2.. I will also be talking about the process model on how we are effectively utilizing the brains and man power across the organization towards a common goal 4. Between me and Jonathan we promise to give you some thought provoking details about analytics and Big data :-)
Most people think of orbitz.com, but Orbitz Worldwide is really a global portfolio of leading online travel consumer brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub. Orbitz also provides business to business services - Orbitz Worldwide Distribution provides hotel booking capabilities to a number of leading carriers such as Amtrak, Delta, LAN, KLM, Air France and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients Orbitz started in 1999, orbitz site launched in 2001.
A couple of years ago when I mentioned Hadoop I’d often get blank stares, even from developers. I think most folks now are at least aware of what Hadoop is.
This chart isn’t exactly an apples-to-apples comparison, but provides some idea of the difference in cost per TB for the DW vs. Hadoop Hadoop doesn’t provide the same functionality as a data warehouse, but it does allow us to store and process data that wasn’t practical before for economic and technical reasons. Putting data into a DB or DWH requires having knowledge or making assumptions about how the data will be used. Either way you’re putting constraints around how the data is accessed and processed. With Hadoop each application can process the raw data in whatever way is required. If you decide you need to analyze different attributes you just run a new query.
The initial motivation was to solve a particular business problem. Orbitz wanted to be able to use intelligent algorithms to optimize various site functions, for example optimizing hotel search by showing consumers hotels that more closely match their preferences, leading to more bookings.
Improving hotel search requires access to such data as which hotels users saw in search results, which hotels they clicked on, and which hotels were actually booked. Much of this data was available in web analytics logs.
Management was supportive of anything that facilitated ML team efforts. But when we presented a hardware spec for servers with local non-raided storage, etc. syseng offered us blades with attached storage.
Hadoop is used to crunch data for input to a system to recommend products to users. Although we use third-party sites to monitor site performance, Hadoop allows the front end team to provide detailed reports on page download performance, providing valuable trending data not available from other sources. Data is used for analysis of user segments, which can drive personalization. This chart shows that Safari users click on hotels with higher mean and median prices as opposed to other users. This is just a handful of examples of how Hadoop is driving business value.
Recently received an email from a user seeking access to Hive. Sent him a detailed email with info on accessing Hive, etc. Received an email back basically saying “you lost me at ssh”.
Previous to 2011 Hadoop responsibilities were split across technology teams. Moving under a single team centralized responsibility and resources for Hadoop.
Processing of click data gathered by web servers. This click data contains marketing info. data cleansing step is done inside data warehouse using a stored procedure further downstream processing is done to generate final data sets for reporting Although this processing generates the required user reports, this process consumes considerable time and resources on the data warehouse, consuming resources that could be used for reports, queries, etc.
ETL step is eliminated, instead raw logs will be uploaded to HDFS which is a much faster process Moving the data cleansing to MapReduce will allow us to take advantage of Hadoop’s efficiencies and greatly speed up the processing. Moves the “heavy lifting” of processing the relatively large data sets to Hadoop, and takes advantage of Hadoop’s efficiencies.
Bad news is we need to significantly increase the number of servers in our cluster, the good news is that this is because teams are using Hadoop, and new projects are coming online.
I met someone at the train station who asked me what I do? I said I work in the web analytics field and I help shape up the strategy and vision at Orbitz worldwide and enable our business teams to get insights on the performance of our site and act upon it. So he said, Ah you do reporting :-) 2. I started thinking why web analytics is hard for people to get and started evangelizing both within and outside Orbitz 3. I manage the webanalytics team at Orbitz worldwide I also try to help out non-profit organizations while I am not busy with my wife and 2 sons
1 So what is web analytics? 2 Read the definition. It tells you exactly why someone came to your site and what kind of impact they had on the bottom line of your revenue 3. Read the definition. You need to immerse yourself in data to understand the story it's telling 4. Read the definition. Focus on Customer. Customer is the king. You need to listen and act upon their feedback 5. Read the definition. Test Test and Test. If you want to prove or disprove a HIPPO's opinion you need to perform tests on your site 6. btw HIPPO is a common terminology in the industry. It stands for Highest Income Paid person's opinion :-)
S o with so many brands and so much data we had quite a few challenges? For starters we couldn't easily do multi dimensional analysis with the tools. With data spread across in multiple tools it was hard to picture the whole 9 yards obviously tools cost money Harder for people to understand where to look at for data With Analytics you need direction rather than precision to take action and get insights
In the Big Data front we didn't have a good infrastructure where we could house all this data in a cost effective way. 2. Data extraction was NOT an easy task 3. Focusing on the key differences on when you need testing v/s when you need reporting. 4. Earlier I mentioned that you need to do rigorous outcome analysis. However, with all the challenges we faced it was not an easy task.
So how do we fit the puzzle? By learning the behavior of the customer and focusing on key attributes Know the travel details such as how many travelers, what kind of travelers, any preferred carrier or hotels? 4. Understand the shopping patterns. Does he want to shop only on weekends or else only on Thursdays. 5. Focus on Visit Patterns. How many times does he come to the site before he buys anything 6. Learn the page navigation. I.e does he see 100 pages every time he comes or does he know exactly what to look at 7. Master the Demand source. Anyone who's worked in the marketing side knows that attribution is a holy war. Deciding which demand source gets the credit for conversion is something people will argue to death Just like the IDE war between VIM, EMACS, Intellij and Eclipse :-)
We realized that with all the challenges we had, we had to innovate and experiment new ways to enable successful web analytics at OWW 2. We generate hundreds of GB of log data per day. How can we effectively store this massive data and how can we mine this data and make sense out of it? 3. Our existing DW was not intended to support such large sets of data and more importantly process this data We also needed to make sure that we don't spend huge money to store this data set. 4. Big data infrastructure with Hadoop has been a huge success at Orbitz and at other organizations So what does this buy us? We can now store data for a long period of time without worrying too much about the space Analysts and developers have access to this data set Developers can run adhoc queries to support our business needs. While the core web analytics team focuses on the company standards and metrics
Here is an example of how we process our site analytics data today. We FTP the log files into our Hadoop infrastructure daily. The files are LZO compressed for better storage utilization. Developers then write Map reduce jobs against these raw log files to output data into HIVE tables. HIVE is a DW equivalent of Hadoop Most of the MR jobs are written using Java and scripting languages such as Python, Ruby, BASH. Business teams however, have skillset to run queries against HIVE tables.
Since the market on Big Data is not that mature there are no good ways to build visualization on top of HIVE 2. Due to this and for other reasons we need to bring a subset of this data into our warehouse. 3. So in essence the data that are in HIVE will make it into the warehouse. 4. There are companies such as Karmaspehe, Datamere who are in the initial stages of bridging the gap between business needs and Hadoop access. 5. However, its too early to say if this will be the norm
We focused on some key areas of our business such as demand source and campaigns as our pilot and worked with our business partners to enable the analytics on Big Data 2. We have developers writing Map Reduce jobs which run every day and populate HIVE tables We generate more than 25 million records for a month for the pilot use case that we worked on This only show cases the sheer magnitude and power of analytics within the Big Data framework
So if you have read Avinash Kaushik’s book and his follow his blog Occoms razor” then you know what he always mentions 2 words Data puke Gold (Insights) Here we have a nice depiction of all kinds of insights provided in a nice dashbaord format to our business users. These insights were only made possible due to the data that we housed and extracted from Hadoop. Obviously I couldn’t share what these graphs meant without giving more details
So how do you organizationally structure yourself and Big Data so that you can be effective both in terms of resource utilization and setting the platform for success 2. This is what we call the Centralized Decentralization. 3. With this approach the core web analytics team controls and supports the individual teams when it comes to data extraction and modeling. 4. This prevents one team from being the bottle neck with data extraction and analytics 5. If you have ever worked in the Data Warehouse side of the world you will know the challenges and delays in getting the data
With the core process of centralized decentralization and being agile how do you succeed? You can't manage if you can't measure. But once you measure make sure you fail fast Every team needs to be thinking of analytics with every feature they work on Dimensional modeling is great but like someone wise said 'All models are wrong but some are useful" :-) My point here is data without analysis is like a Ferrari without gas. If you Make it a point to extract smaller chunks of data and tie this effort to your business objectives. You are sure to succeed
Here are some key learning's from our experience and some thoughts for you to consider If you have the strength of technology go for it. This needs heavy investment from time and resource perspective Like I mentioned many times data without analysis is worthless
Thanks again for listening to our story and we would be available for any further questions you may have. Also if you are know anyone who is interested in working at Orbitz please check out the career site