Business Integration with
         CDH 4
       (including Apache Hadoop)

   Alexander Alten-Lorenz, Cloudera INC
       Muenchen, 22. February 2013
Challenges




Volume      Velocity   Variety
Business Integration
•   CRM               •   Invoicing

•   Analytics         •   Risk Management

•   Social Networks   •   Universal Data Access

•   Marketing         •   Data Governance

•   Document Store    •   SAP / Salesforce

•   Search-Indices    •   Article and Storage
                          Management
Use Cases
Risk Management

• Problem: Scoring of Customers and
  Projects
• Solution: Finance History, Communication
  and Pattern Detection
• User: Finance, Insurance
Recommendations
• Problem: Recommend convenient products
  to purchased products, matching the
  interests
• Solution: Statistical analysis of interests,
  purchase history, detect matching swarm
  patterns
• Users: eCommerce, Advertising
Graph-Analytics
• Problem: Detect trends and curves in large
  distributed networks (Wired, Social, Mesh)
• Solution: Collecting and Data Mining all
  data, applying to self learning patterns to
  detect trends and forecasts
• User: Enterprises, Gov, NGO, Provider,
  Telco, Stock Exchange
Detection of
       Dangerous Use
• Problem: Spam, Credit Card Abuse
• Solution: Pattern Detection, Prioritizing,
  heuristically Analytics
• Users: Retail, Finance, Reseller
Text Analysis

• Problem: Detect the meaning of the written
  word (Sentiment Analysis)
• Solution: Keyword patterns, Coherences
  detection, Path detection
• Users: eCommerce, Social Media Service
  Provider, Attitude Research
Amounts of real Data

• Ebay: 12 PB, Search Optimization
• Facebook: 50 PB, Logs, Reports
• Walmart, 4.5 PB, Customer Transactions
          http://wiki.apache.org/hadoop/PoweredBy
             http://en.wikipedia.org/wiki/Big_data
Apache Hadoop
• Software Framework for large amounts of
  unstructured data
• Apache-License
• Two main cores
 • HDFS: Distributed data storage
 • MapReduce: Distributed data handling
Hadoop Cluster
Data Node     Data Node   Data Node    Data Node
Data Node     Data Node   Data Node    Data Node
Data Node     Data Node   Data Node    Data Node
Data Node     Data Node   Data Node    Data Node
Data Node     Data Node   Data Node    Data Node
Data Node     Data Node   Data Node    Data Node
Data Node     Data Node   Data Node    Data Node


       Data Node: 4-16 Cores, 4-16 Disks,
        8-64 GB RAM, 1-10GB Network
Hadoop Distributed
      File System
                            File




Block    Block   Block     Block     Block   Block    Block




  Data Node              Data Node              Data Node
MapReduce
                 Data




RDBMS    Query




                 Data




Hadoop   Query
Features
                  HDFS   MapReduce

 Distribution      ✔        ✔


Fault Tolerance    ✔        ✔


  Scalability      ✔        ✔
Hadoop Eco System
         SQL               Scripts            HBase
                                                      Whirr
         Hive               Pig               Oozie


               MapReduce                                Avro
                                        Java API
                 HDFS
                                                 eeper
                                             Zook
 Sqoop           Flume     Connectors                    Hue


 RDBMS            Logs            ...          Mahout
Example of a
 Integration
Scope
• Successful Audits per ISO 27001
• Analyze different Data Sources from
  different Data Bases and CRM Systems
• Realtime and Lifetime Statistics per Product
• Periodical Analytic and Statistic Jobs
• Weekly Re-Import into CRM
• Single Queries per User (Analyst) over a
  Secured GUI
Solution Path
• Cluster Authentication and Authorization via
  Kerberos and crypted data communication / Data
  Protection
• Sqoop Connector to CRM / DB
   • Terradata, Oracle, Postgres, MySQL, MS SQL
• Hive - HBase Integration
• Hive Analytics, controlled automatically over Oozie
  Workload Orchestrator
• Hue Shell, Authentication via Kerberos SPNEGO
CRM Park         Integration         CDH    Authentification




                     Sqoop
                                                        Kerberos
                                                       (AD, MITv5)




Real Time    HBase                   Hive                       Oozie




                                                              Automation
   Enduser                     HUE
How to Manage?
Cloudera Manager
•   Automated Deployment   •   Reporting

•   Monitoring             •   Support Integration

•   Service Management

•   Log Management

•   Events and Alerts
Cloudera
• Founded 2009 in Palo Alto
• Cloudera's Distribution Including Hadoop
• CDH4 / Cloudera Manager 4
• > 320 employees worldwide
• Training, Consulting, Support, Development
• Enterprise Tools
Thank You!

• alexander@cloudera.com
• Twitter: @mapredit
• Blog: mapredit.blogspot.com
• http://www.cloudera.com/
• http://hadoop. apache.org/

Bi with apache hadoop(en)

  • 1.
    Business Integration with CDH 4 (including Apache Hadoop) Alexander Alten-Lorenz, Cloudera INC Muenchen, 22. February 2013
  • 2.
    Challenges Volume Velocity Variety
  • 3.
    Business Integration • CRM • Invoicing • Analytics • Risk Management • Social Networks • Universal Data Access • Marketing • Data Governance • Document Store • SAP / Salesforce • Search-Indices • Article and Storage Management
  • 4.
  • 5.
    Risk Management • Problem:Scoring of Customers and Projects • Solution: Finance History, Communication and Pattern Detection • User: Finance, Insurance
  • 6.
    Recommendations • Problem: Recommendconvenient products to purchased products, matching the interests • Solution: Statistical analysis of interests, purchase history, detect matching swarm patterns • Users: eCommerce, Advertising
  • 7.
    Graph-Analytics • Problem: Detecttrends and curves in large distributed networks (Wired, Social, Mesh) • Solution: Collecting and Data Mining all data, applying to self learning patterns to detect trends and forecasts • User: Enterprises, Gov, NGO, Provider, Telco, Stock Exchange
  • 8.
    Detection of Dangerous Use • Problem: Spam, Credit Card Abuse • Solution: Pattern Detection, Prioritizing, heuristically Analytics • Users: Retail, Finance, Reseller
  • 9.
    Text Analysis • Problem:Detect the meaning of the written word (Sentiment Analysis) • Solution: Keyword patterns, Coherences detection, Path detection • Users: eCommerce, Social Media Service Provider, Attitude Research
  • 10.
    Amounts of realData • Ebay: 12 PB, Search Optimization • Facebook: 50 PB, Logs, Reports • Walmart, 4.5 PB, Customer Transactions http://wiki.apache.org/hadoop/PoweredBy http://en.wikipedia.org/wiki/Big_data
  • 11.
    Apache Hadoop • SoftwareFramework for large amounts of unstructured data • Apache-License • Two main cores • HDFS: Distributed data storage • MapReduce: Distributed data handling
  • 12.
    Hadoop Cluster Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node: 4-16 Cores, 4-16 Disks, 8-64 GB RAM, 1-10GB Network
  • 13.
    Hadoop Distributed File System File Block Block Block Block Block Block Block Data Node Data Node Data Node
  • 14.
    MapReduce Data RDBMS Query Data Hadoop Query
  • 15.
    Features HDFS MapReduce Distribution ✔ ✔ Fault Tolerance ✔ ✔ Scalability ✔ ✔
  • 16.
    Hadoop Eco System SQL Scripts HBase Whirr Hive Pig Oozie MapReduce Avro Java API HDFS eeper Zook Sqoop Flume Connectors Hue RDBMS Logs ... Mahout
  • 17.
    Example of a Integration
  • 18.
    Scope • Successful Auditsper ISO 27001 • Analyze different Data Sources from different Data Bases and CRM Systems • Realtime and Lifetime Statistics per Product • Periodical Analytic and Statistic Jobs • Weekly Re-Import into CRM • Single Queries per User (Analyst) over a Secured GUI
  • 19.
    Solution Path • ClusterAuthentication and Authorization via Kerberos and crypted data communication / Data Protection • Sqoop Connector to CRM / DB • Terradata, Oracle, Postgres, MySQL, MS SQL • Hive - HBase Integration • Hive Analytics, controlled automatically over Oozie Workload Orchestrator • Hue Shell, Authentication via Kerberos SPNEGO
  • 20.
    CRM Park Integration CDH Authentification Sqoop Kerberos (AD, MITv5) Real Time HBase Hive Oozie Automation Enduser HUE
  • 21.
  • 22.
    Cloudera Manager • Automated Deployment • Reporting • Monitoring • Support Integration • Service Management • Log Management • Events and Alerts
  • 23.
    Cloudera • Founded 2009in Palo Alto • Cloudera's Distribution Including Hadoop • CDH4 / Cloudera Manager 4 • > 320 employees worldwide • Training, Consulting, Support, Development • Enterprise Tools
  • 24.
    Thank You! • alexander@cloudera.com •Twitter: @mapredit • Blog: mapredit.blogspot.com • http://www.cloudera.com/ • http://hadoop. apache.org/