Big data and lynda_Subash_DSouza.com

867 views

Published on

Big Data Camp LA 2014, How Lynda.com is getting started with Big Data By Subash DSouza of Lynda.com

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
867
On SlideShare
0
From Embeds
0
Number of Embeds
98
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • The Tin Can API, now officially known as "Experience API" (xAPI), is an e-learning software specification that allows learning content and learning systems to speak to each other in a manner that records and tracks all types of learning experiences which are stored in Learning record Stores.
  • Big data and lynda_Subash_DSouza.com

    1. 1. Big Data and Lynda.com Subash DSouza 1
    2. 2. 2 • lynda.com is an online learning company that helps anyone learn software, design, and business skills to achieve their personal and professional goals • Founded in 1995 by Lynda Weinman and Bruce Heavin. • Went online in 2002. • As of January 2014, lynda.com offers more than 2,400 courses in business, design, web, programming, photography, video, 3D and animation, audio, education, and CAD Who is Lynda.com?
    3. 3. 3 Why Big Data? • With the growth of users on Lynda.com, data has increased rapidly. • With the amount of data we collect, there has a been a drive to derive more insights from the data. • We collect data from multiple sources such as Google Analytics, internal logs and user sessions.
    4. 4. 4 Current Use cases of Big Data at Lynda.com • We use MongoDB for a Learning Record Store, host user configuration for Notifications, as well as for a data source for the localized text on the main web site. • A Learning Record Store (LRS) is a data store that serve as a repository for learning records necessary for using the Tin Can API.
    5. 5. 5 Current Use cases of Big Data at Lynda.com • Recommendation algorithms using Myrrix. We have data that is fed once a day to our recommendations servers which run on Myrrix. • Myrrix was a Machine “Big Learning” Software built on top of Apache Hadoop and Apache Mahout. • It was brought out by Cloudera last August • Succeeded by Oryx, which has tighter integration with CDH • Working on migrating to Oryx
    6. 6. 6 The future of Big Data at Lynda.com • Use the data we collect to gain better insights into our business decision making • Combine Google Analytics with our own internal logs and User Sessions to understand our users better. This will allow us to create customized experiences for our users. • A better user experience will keep the user on the site for longer and will also be better for turnover rate
    7. 7. 7 How we are achieving that? • Building out Hadoop Clusters on YARN • Use HBase for some of our real time use cases • Testing out Spark and Storm • Still in early stages
    8. 8. •Introduction of Hadoop to lynda.com Big Data Overview 8 Agenda
    9. 9. Hadoop Architecture Stack 9 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s
    10. 10. Hadoop Architecture Stack 10 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Data Collecting/Acquisition Start with Archiving User Sessions Data Acquisition Google Analytics Lynda Logs.
    11. 11. Hadoop Architecture Stack 11 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s
    12. 12. Hadoop Architecture Stack 12 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Staging Data Processing ELT Put the data in one place so that it can be Transformed efficiently by another process. This will be the “Extract” and “Load” part of the ELT process.
    13. 13. Hadoop Architecture Stack 13 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s HDFS With HDFS and the other components of the Hadoop Stack lynda.com will be able to acquire and store large amounts of data quickly and accurately.
    14. 14. Hadoop Architecture Stack 14 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Consumable Data This is data that has been transformed and can be consumed by systems outside of Hadoop. Given our lack of expertise in Java we will probably rely on our ingestion or rather use an ETL rather than a ELT strategy.
    15. 15. Hadoop Architecture Stack 15 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s HBase This interface to Hadoop is tightly integrated with HDFS. Hive and Pig do not have this tight integration.
    16. 16. Hadoop Architecture Stack 16 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Hive/Pig Hive and Pig are SQL/Scripting interfaces into Hadoop. Both of these interfaces sit outside of Hadoop.
    17. 17. Hadoop Architecture Stack 17 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s RDBMS/Flat Files Hadoop data will be “pushed” and/or “pulled” into RDMS’ or Flat Files for consumption outside of the Hadoop stack.
    18. 18. Hadoop Architecture Stack 18 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Services and API’s API’s will be available for the consumption of data. These API’s will make data available from Hadoop and RDMBS’s.
    19. 19. Hadoop Architecture Stack 19 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Security Authentication & Access to the HDFS data will be done with Kerberos. Note: This Security will not be comparable to an RDBMS.
    20. 20. Hadoop Architecture Stack 20 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Hcatalog HCatalog abstracts data locations and standardizes data types across Pig, Hive, and MapReduce. It is a Meta Data tool that is part of the Hadoop ecosystem.
    21. 21. Hadoop Architecture Stack 21 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Map Reduce In regards to Hadoop and manipulating data in HDFS this is “lower level” programming. It will be awhile before we venture into this area of expertise. This is all written in Java and requires a strong understanding of the Hadoop File System (HDFS).
    22. 22. Hadoop Architecture Stack 22 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s oozie Scheduling Map Reduce Jobs Need Scheduling. Put Map Reduce Jobs somewhere for consumption This could be in Hadoop itself Oozie – Workflow organizer Python or Cron Scripts Data Output – Data Output of Scheduled jobs. Send emails for reports Where the data will be put In what format will they be put like into a SQL table or file
    23. 23. Hadoop Architecture Stack 23 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s sqoop Sqoop is an Apache project that is designed to “sqoop” export data between Hadoop and Relational Databases. Data is “sqooped up” and put into SQLServer or dumped into a file. Remember: “The tyranny of “OR” and the inclusiveness of “AND””. We are not going to use SqlServer OR Hadoop. We will use SqlServer AND Hadoop. Facebook has to use both and when it comes to this technology we are not better than Facebook.
    24. 24. Hadoop Architecture Stack 24 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s flume Flume is part of the Hadoop ecosystem that is used to collect data and or data files from multiple locations and load it into HDFS.
    25. 25. Hadoop Architecture Stack 25 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Nagios, Ganglia, Ambari, Cloudera Manager Ganglia, Nagios, Ambari, and Cloudera Manager can be used to monitor the Map Reduce Operations. This will ensure that jobs are running on time and it will ensure that alerts are sent when jobs are running too long. These tools will also assist in performance monitoring and optimization.
    26. 26. Hadoop Architecture Stack 26 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Services and API Access to Hive/Pig
    27. 27. Hadoop Architecture Stack 27 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Hue aggregates the most common Hadoop components (i.e. file browser for HDFS, Job Browser (Map Reduce, YARN), Hbase, Hive, Pig) into a single interface.
    28. 28. Hadoop Architecture Stack 28 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Avro Avro – It uses JSON for defining data types and protocols, and serializes data in a compact binary format. It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
    29. 29. Hadoop Architecture Stack 29 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Business Intelligence B.I. Strategy will need to developed and enabled. This will be critical because one of the cited “Greatest” benefits of Hadoop is that of discovery. We will need to Enable discovery in this paradigm.
    30. 30. Hadoop Architecture Stack 30 Extract Load Transform HDFS Propagate RDBMS/Files API Access Business Intelligence Staging ConsumableData Security Kerberos Data Chronology Hive/Pig Hbase Meta Data HCatalog Job Scheduling oozie Extract to RDBMS sqoop Monitoring Tools Nagios, Ganglia, Ambari Direct Access to Raw Data Hue Data Serialization Avro Governance HadoopStackandDataAccess Data Extraction Flume Google Analytics Data Movement Map Reduce lyndaLogs User Sessions ServicesandAPI`s Governance The fundamental essentials of Data Governance will need to established. Core values like “Master Data” will need to be established and the “Big Data” Platform will need to be beholden and integrated with these Data Governance Values. Issues like data life cycle and entitlements to Pii data will be part of the Big Data implementation.
    31. 31. Hadoop Architecture Stack 31 f l umeIngest Describe Hcatalog Compute Map Reduce Persist HDFS/Hbase Monitor Nagios Propagate Sqoop Develop Hive/Pig /avros Process Implementation HadoopAnthology .
    32. 32. 32 Thank you!! • @sawjd22 • sdsouza@lynda.com • www.linkedin.com/in/sawjd/ • Q&A!!

    ×