A Security Data Warehouse (SDW) is a massive database built using Hadoop and Hive that aggregates security and fraud-related event data from across an entire enterprise for long-term analytics. Zions Bank built an SDW to address limitations of their SIEM in dealing with large, unstructured datasets and to provide a common platform where security and fraud teams could collaborate by analyzing the complete historical data in one system. The SDW utilizes various Hadoop features for scalability, fault tolerance and handling different data types to support petabytes of stored data and thousands of daily analysis jobs.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
A summary of the History of Hadoop, some observations about the current state of Hadoop for new users and some predictions about its future (Hint, it's gonna be huge).
Presented at:
http://www.meetup.com/Pasadena-Big-Data-Users-Group/events/203961192/
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
A summary of the History of Hadoop, some observations about the current state of Hadoop for new users and some predictions about its future (Hint, it's gonna be huge).
Presented at:
http://www.meetup.com/Pasadena-Big-Data-Users-Group/events/203961192/
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
Big Data projects are a struggle, not only on the technical side but also on the organizational side. In this talk the author shares his experience and opinions from almost 5 years of Big Data projects and develops an Agile Big Data Model which reflects his ideas on how Big Data projects can be successful, even in large companies.
Talk held at the crossover meetup of the "Agile Stammtisch Rhein-Main" and the "Hadoop & Spark User Group Rhein-Main" at codecentric AG on 31.01.2017.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
Александр Козлов, Cloudera Inc.
Александр Козлов, старший архитектор в Cloudera Inc., работает с большими компаниями, многие из которых находятся в рейтинге Fortune 500, над проектами по созданию систем анализа большого количества данных. Закончил аспирантуру физического факультета Московского государственного университета, после чего также получил степень Ph.D. в Стэнфорде. До Cloudera и после окончания учебы работал над статистическим анализом данных и соответствующими компьютерными технологиями в SGI, Hewlett-Packard, а также стартапе Turn.
Тема доклада
Контроль зверей: инструменты для управления и мониторинга распределенных систем от Cloudera.
Тезисы
Поддержание распределенных систем, состоящих из тысяч компьютеров, является сложной задачей. Компания Cloudera, которая специализируется на создании распределенных технологий, разработала набор средств для централизованного управления распределенных Hadoop/HBase кластеров. Hadoop и HBase являются проектами Apache Software Foundation, и их применение для анализа частично структурированных данных ускоряется во всем мире. В этом докладе будет рассказано о SCM, системе для конфигурации, настройки, и управления Hadoop/HBase и Activity Monitor, системе для мониторинга ряда ОС и Hadoop/HBase метрик, а также об особенностях подхода Cloudera в отличие от существующих решений для мониторинга (Tivoli, xCat, Ganglia, Nagios и т.д.).
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this presentation, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
Stinger.Next by Alan Gates of HortonworksData Con LA
ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.
Development of concurrent services using In-Memory Data Gridsjlorenzocima
As part of OTN Tour 2014 believes this presentation which is intented for covers the basic explanation of a solution of IMDG, explains how it works and how it can be used within an architecture and shows some use cases. Enjoy
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
Big Data projects are a struggle, not only on the technical side but also on the organizational side. In this talk the author shares his experience and opinions from almost 5 years of Big Data projects and develops an Agile Big Data Model which reflects his ideas on how Big Data projects can be successful, even in large companies.
Talk held at the crossover meetup of the "Agile Stammtisch Rhein-Main" and the "Hadoop & Spark User Group Rhein-Main" at codecentric AG on 31.01.2017.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
Александр Козлов, Cloudera Inc.
Александр Козлов, старший архитектор в Cloudera Inc., работает с большими компаниями, многие из которых находятся в рейтинге Fortune 500, над проектами по созданию систем анализа большого количества данных. Закончил аспирантуру физического факультета Московского государственного университета, после чего также получил степень Ph.D. в Стэнфорде. До Cloudera и после окончания учебы работал над статистическим анализом данных и соответствующими компьютерными технологиями в SGI, Hewlett-Packard, а также стартапе Turn.
Тема доклада
Контроль зверей: инструменты для управления и мониторинга распределенных систем от Cloudera.
Тезисы
Поддержание распределенных систем, состоящих из тысяч компьютеров, является сложной задачей. Компания Cloudera, которая специализируется на создании распределенных технологий, разработала набор средств для централизованного управления распределенных Hadoop/HBase кластеров. Hadoop и HBase являются проектами Apache Software Foundation, и их применение для анализа частично структурированных данных ускоряется во всем мире. В этом докладе будет рассказано о SCM, системе для конфигурации, настройки, и управления Hadoop/HBase и Activity Monitor, системе для мониторинга ряда ОС и Hadoop/HBase метрик, а также об особенностях подхода Cloudera в отличие от существующих решений для мониторинга (Tivoli, xCat, Ganglia, Nagios и т.д.).
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this presentation, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
Stinger.Next by Alan Gates of HortonworksData Con LA
ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.
Development of concurrent services using In-Memory Data Gridsjlorenzocima
As part of OTN Tour 2014 believes this presentation which is intented for covers the basic explanation of a solution of IMDG, explains how it works and how it can be used within an architecture and shows some use cases. Enjoy
3 Things to Learn:
-How data is driving digital transformation to help businesses innovate rapidly
-How Choice Hotels (one of largest hoteliers) is using Cloudera Enterprise to gain meaningful insights that drive their business
-How Choice Hotels has transformed business through innovative use of Apache Hadoop, Cloudera Enterprise, and deployment in the cloud — from developing customer experiences to meeting IT compliance requirements
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Slides of a talk given to the Seattle Chapter of the Cloud Security Alliance. Looks briefly at Architectures, Sources of Log Data, and behavioral signatures in the data and issues and observations around using Big Data products for security.
Mtc learnings from isv & enterprise interactionGovind Kanshi
This is one of the dated presentation for which I keep getting requests for, please do reach out to me for status on various things as Azure keeps fixing/innovating whole of things every day.
There are bunch of other things I can help you on to ensure you can take advantage of Azure platform for oss, .net frameworks and databases.
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
This is little dated deck for our learnings - I keep getting multiple requests for it. I have removed one slide for access permissions (RBAC -which are now available).
What exactly is big data? The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs. Put simply, big data is larger, more complex data sets, especially from new data sources.
What is Big Data and why it is required and needed for the organization those who really need and generating huge amount of data and when it will be use
A talk through the journey we've been through at Snowplow thinking about event data, starting with our focus on web and then mobile analytics, and exploring our current and future technical and analytic approaches
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?
Securing and governing a multi-tenant data lake within the financial industryDataWorks Summit
Standard Bank South Africa is a Hortonworks client, with several multi-node clusters hosting Hortonworks Data Platform (HDP) and Hortonworks Data Flow (HDF). This presentation will discuss the technical detail of implementing security, governance and multi-tenancy on a "Data Lake" within the finance industry. The talk will address the team's experiences, challenges, failures and learnings that we took away from this behemoth of an adventure.
After introducing Standard Bank and the Hadoop admin team, the presentation will describe the security and governance journey Standard Bank has undergone since the project's inception in 2015, as well as the roadmap for the future ahead.
Presentation structure:
1. Team introduction with background information
2. Environment overview (Where we are - Current)
-----Security
---------Authentication through Kerberos and LDAP/ AD
---------Authorization through Ranger and Centrify
---------Transparent Data Encryption (TDE) at rest
-----Governance
---------Centralized auditing
---------Ranger policies and data steward ownership
-----Multi-Tenancy
---------Data lake Vs. data analytics platform
---------Edge nodes Vs. API framework through Knox
3. How did we get to this stage? (Past)
-----Challenges faced (Kerberos, AD integration, SSL)
-----How we overcame these challenges
4. Future challenges we foresee (Future)
-----How we are planning to prepare for them
Speakers
Ian Pillay, Hadoop Administrator, Standard Bank
Brad Smith, Hadoop Administrator, Standard Bank
Growing data volume, microservices, number of platforms
and data complexity makes traditional data validation
solutions costly to scale and difficult to manage.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
1. Security
Data
Deluge-‐
Zions
Bank's
Hadoop
Based
Security
Data
Warehouse
Claiming
the
Intersec>on
@
Informa>on
Security
and
Fraud
Brian
Chris>an,
CTO
and
Co-‐Founder,
ZeEaset
Michael
Fowkes,
SVP
and
Director
of
Fraud
Preven>on
and
Security
Analy>cs
,
Zions
Bancorp
2. Security
Data
Warehouse
• A
security
data
warehouse
is
a
massive
database
intended
to
aggregate
event
data
across
your
en>re
enterprise;
for
long
term
large-‐scale
security/fraud
related
analy>cs
• The
u>lity
of
this
system
is
realized
once
the
data
is
normalized
into
a
common
format,
and
mined
by
experts
with
in>mate
understanding
of
the
data
itself
• It’s
also
affordable
to
the
common
company
3. Why
SDW
Today
• “More
data
is
generated
in
3
days
than
in
the
history
of
the
world
to
2003”
–
Eric
Schmidt
• Fraudsters
con>nue
to
innovate
and
leverage
explosive
growth
of
portable
compu>ng
• Fraudsters
con-nue
to
study
“us”
• Through
massive
data
sets
and
comprehensive
analy>c
modeling,
you/we
can
begin
to
study
them
4. SDW
isn’t
a
product
• Security
is
never
a
product
it’s
a
process
• There
are
past
“processes”
that
help
build
the
system
– Key
example
is
SIEM:
SIEM
creates
a
“Big
Data”
problem
for
InfoSec.
Instead
of
dumping
that
data
a]er
60
days,
store
ALL
the
data
in
the
SDW
–
even
the
events
you’re
currently
not
logging
• When
fraud
teams
work
with
security,
the
common
pla`orm
will
accelerate
the
program
5. SDW
Data
Collec>on
• The
SDW
is
intended
to
collect
EVERYTHING
– Everything
in
terms
of
event
data/not
just
security
• SDW
business
analysts
live
by
the
expression
“the
more
data
I
receive,
the
beEer
I
feel”
• All
data
is
created
equal
–
but
data
mined
in
certain
combina>ons
is
more
interes>ng
than
others
–
Trust
but
verify:
This
goes
for
both
automated
controls
as
well
as
human
behavior
6. SDW
System
Availability
• The
system
should
be
easy
to
use
– Average
skilled
labor
to
maintain
the
pla`orm/
cluster
• The
system
must
fault
tolerant
– At
700TB
–
2PB
of
data,
when
a
hard
drives
fail
the
system
should
maintain
its
process
• The
SDW
should
grow
as
needed
without
performance
degrada>on
– Affordable
to
meet
tomorrow’s
demand
7. SDW
is
used
for
Mining
• SDW
is
where
Informa>on
Security
and
Fraud
teams
meet
to
solve
problems
– Most
InfoSec
and
Fraud
don’t
communicate
– Silos
of
data
are
collapsed
into
a
single
view
• The
SDW
is
a
laboratory
–
Not
SIEM
– Are
there
indicators
that
other
users/accounts
are
suscep>ble
to
fraud/aEack
– Run
the
model
through
the
en>re
database
to
account
for
similar
aEributes
8. What
is
a
Security
Data
Warehouse
• A
Security
Data
Warehouse
is
a
massive
mineable
database
• The
system
is
horizontally
scalable
to
Petabytes
of
data
• The
amount
of
data
available
for
analysis
is
historical
and
many
years
old
• Its
affordable
to
the
common
person!
– Risk
Management
are
the
common
people
in
IT
16. SIEM
Issues
• Rigid
data
models
• Did
not
deal
well
with
unstructured
data
• RDMS
performance
with
large
data
sets
• Limited
ways
to
interact
with
data
19. Why
Hadoop
&
Hive
• Scalability
/
performance
• Manage
resources
– Fair
Scheduler
• Fault
tolerance
• SQL
like
language
(HiveQL)
– Most
of
the
staff
had
SQL
skills
• Easy
applica>on
/
tool
integra>on
– ODBC
/
JDBC
driver
• Fast
data
inges>on
• Can
handle
unstructured
data
• Flexibility
&
extensibility
– UDF’s
– Streaming
jobs
20.
21.
22. ETL
Philosophy
• “Pre-‐mine”
Intelligence
during
the
ETL
process
– Add
value
at
>me
of
capture
(enrichment)
– Quickly
analyze
important
data
– Automate
>me-‐sensi>ve
ac>vi>es
• Load
all
data…no
filtering
of
data
that
will
be
loaded
into
the
warehouse
– You
don’t
know
what
you
will
want
tomorrow
– Leverage
file
compression,
rcfiles,
and
table
par>>oning
to
address
storage
/
performance
issues
• Store
2
years
worth
of
historical
data
23.
24. The
Team
• Data
Scien>st
• Data
Analyst
• LOB
User
• Data
Engineer
• Data
Pla`orm
Administrator
28. Data
Examples
• Web
server
logs
• Customer
database(s)
• OS
logs
• Fraud
model
alerts
• DB
logs
• Mainframe
ac>vity
logs
• Proxy
server
logs
• HTTP
(customer
Internet
ac>vity)
• SPAM
filter
logs
logs
• A/V
events
• ATM/POS
transac>ons
• DLP
events
• Credit
card
transac>ons
• VPN
logs
• G/L
logs
• DNS
logs
• ACH,
Wire,
and
Deposit
• Firewall
logs
transac>ons
• On-‐line
banking
applica>on
logs
• E-‐mail
logs
• Deposits
/
savings
/
>me
account
• Router
/
switch
logs
daily
balances
• IP
blacklists
• Vulnerability
scan
results
Over
120
data
sets
29. How
users
interact
with
the
data
• Data
Scien>st
– KNIME
– R
– Tableau
• Data
Analyst
– SQuirreL
SQL
– Hive
command
line
• LOB
User
– Datameer
– Custom
web
app
for
common
queries
• Parameterized
queries
• Output
to
HTML
table
or
tab
delimited
file
30. Sample
firewall
log
query
SELECT
collect_set(src_ip)
as
src_ips,
dst_ip,
protocol,
ac>on,
rule_uid,
collect_set(rule)
as
rules,
count
(*)
as
log_entry_count
FROM
firewall_logs
WHERE
day
=
‘2012-‐05-‐26’
AND
dst
=
‘1.1.1.1’
GROUP
BY
ac>on,
dst,
proto,
service,
rule_uid
ORDER
BY
dst_ip,
protocol,
rule_uid
31. Dealing
with
unstructured
data
Via
Perl:
while
(<INFILE>)
{
if
(
$_
=~
/s+w+s+Transac>onsInquirys+/)
{
chomp
$_;
my
($ts,
$ip,
$port,
$payload)
=
split(/|/,
$_);
if
($payload
=~
/
s+Account:s+d+s+Appl:s+(w+)s+/)
{
$product_code
=
$1;
}
if
($payload
=~
/
s+Account:s+(d+)s+Appl:s+w+s+/)
{
$account
=
$1;
}
if
($payload
=~
/
s+w+s+Transac>ons+Inquirys+(w+)s+/)
{
$bank
=
$1;
}
if
($payload
=~
/
s+(w+)s+Transac>ons+Inquirys+/)
{
$agent
=
$1;
}
print
OUTFILE
“$ts|$ip|$port|$product_code|$account|$bank|$agentn”;
}
}
32. Dealing
with
unstructured
data
Via
Hive:
SELECT
ts,
ip
,
port
,
regexp_extract(payload,
's+Account:s+d+s+Appl:s+(w+)s+',
1)
as
product_code,
regexp_extract(payload,
's+Account:s+(d+)s+Appl:s+w+s+',
1)
as
account,
regexp_extract(payload,
's+w+s+Transac>ons+Inquirys+(w+)s+',
1)
as
bank,
regexp_extract(payload,
's+(w+)s+Transac>ons+Inquirys+',
1)
as
agent
FROM
mainframe_logs
WHERE
day
=
'2012-‐05-‐26'
AND
payload
rlike
's+w+s+Transac>onsInquirys+’
34. Visualiza>on
example
Fraud
Events
Per
Day
JAN
FEB
MAR
APR
MAY
JUN
JUL
AUG
SEP
OCT
NOV
DEC
Sun
Mon
Tue
2010
Wed
Thu
Fri
Sat
Sun
Mon
Tue
2011
Wed
Thu
Fri
Sat
Sun
Mon
Tue
2012
Wed
Thu
Fri
Sat