If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. You are expected to be familiar with the Unix/Linux command-line interface and have some experience with the Java programming language. Familiarity with Hadoop would be a plus.
Intro Learning Hadoop 2 Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why subscribe? Free access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions 1. Introduction A note on versioning The background of Hadoop Components of Hadoop Common building blocks Storage Computation Better together Hadoop 2 - what's the big deal? Storage in Hadoop 2 Computation in Hadoop 2 Distributions of Apache Hadoop A dual approach AWS - infrastructure on demand from Amazon Simple Storage Service (S3) Elastic MapReduce (EMR) Getting started Cloudera QuickStart VM Amazon EMR Creating an AWS account Signing up for the necessary services Using Elastic MapReduce Getting Hadoop up and running How to use EMR AWS credentials The AWS command-line interface Running the examples Data processing with Hadoop Why Twitter? Building our first dataset One service, multiple APIs Anatomy of a Tweet Twitter credentials Programmatic access with Python Summary 2. Storage The inner workings of HDFS Cluster startup NameNode startup DataNode startup Block replication Command-line access to the HDFS filesystem Exploring the HDFS filesystem Protecting the filesystem metadata Secondary NameNode not to the rescue Hadoop 2 NameNode HA Keeping the HA NameNodes in sync Client configuration How a failover works Apache ZooKeeper - a different type of filesystem Implementing a distributed lock with sequential ZNodes. Implementing group membership and leader election using ephemeral ZNodes Java API Building blocks Further reading Automatic NameNode failover HDFS snapshots Hadoop filesystems Hadoop interfaces Java FileSystem API Libhdfs Thrift Managing and serializing data The Writable interface Introducing the wrapper classes Array wrapper classes The Comparable and WritableComparable interfaces Storing data Serialization and Containers Compression General-purpose file formats Column-oriented data formats RCFile ORC Parquet Avro Using the Java API Summary 3. Processing - MapReduce and Beyond MapReduce Java API to MapReduce The Mapper class The Reducer class The Driver class Combiner Partitioning The optional partition function Hadoop-provided mapper and reducer implementations Sharing reference data Writing MapReduce programs Getting started Running the examples Local cluster Elastic MapReduce WordCount, the Hello World of MapReduce Word co-occurrences Trending topics The Top N pattern Sentiment of hashtags Text cleanup using chain mapper Walking through a run of a MapReduce job Startup Splitting the input Task assignment Task startup Ongoing JobTracker monitoring Mapper input Mapper execution Mapper output and reducer input Reducer input Reducer execution Reducer output Shutdown Input/Output InputFormat and RecordReader Hadoop-provided InputFormat Hadoop-provided RecordReader OutputFormat and RecordWriter Hadoop-provided OutputFormat Sequence files YARN YARN architecture The components of YARN Anatomy of a YARN application Life cycle of a YARN application Fault tolerance and monitoring Thinking in layers Execution models. YARN in the real world - Computation beyond MapReduce The problem with MapReduce Tez Hive-on-tez Apache Spark Apache Samza YARN-independent frameworks YARN today and beyond Summary 4. Real-time Computation with Samza Stream processing with Samza How Samza works Samza high-level architecture Samza's best friend - Apache Kafka YARN integration An independent model Hello Samza! Building a tweet parsing job The configuration file Getting Twitter data into Kafka Running a Samza job Samza and HDFS Windowing functions Multijob workflows Tweet sentiment analysis Bootstrap streams Stateful tasks Summary 5. Iterative Computation with Spark Apache Spark Cluster computing with working sets Resilient Distributed Datasets (RDDs) Actions Deployment Spark on YARN Spark on EC2 Getting started with Spark Writing and running standalone applications Scala API Java API WordCount in Java Python API The Spark ecosystem Spark Streaming GraphX MLlib Spark SQL Processing data with Apache Spark Building and running the examples Running the examples on YARN Finding popular topics Assigning a sentiment to topics Data processing on streams State management Data analysis with Spark SQL SQL on data streams Comparing Samza and Spark Streaming Summary 6. Data Analysis with Apache Pig An overview of Pig Getting started Running Pig Grunt - the Pig interactive shell Elastic MapReduce Fundamentals of Apache Pig Programming Pig Pig data types Pig functions Load/store Eval The tuple, bag, and map functions The math, string, and datetime functions Dynamic invokers Macros Working with data Filtering Aggregation Foreach Join Extending Pig (UDFs). Contributed UDFs Piggybank Elephant Bird Apache DataFu Analyzing the Twitter stream Prerequisites Dataset exploration Tweet metadata Data preparation Top n statistics Datetime manipulation Sessions Capturing user interactions Link analysis Influential users Summary 7. Hadoop and SQL Why SQL on Hadoop Other SQL-on-Hadoop solutions Prerequisites Overview of Hive The nature of Hive tables Hive architecture Data types DDL statements File formats and storage JSON Avro Columnar stores Queries Structuring Hive tables for given workloads Partitioning a table Overwriting and updating data Bucketing and sorting Sampling data Writing scripts Hive and Amazon Web Services Hive and S3 Hive on Elastic MapReduce Extending HiveQL Programmatic interfaces JDBC Thrift Stinger initiative Impala The architecture of Impala Co-existing with Hive A different philosophy Drill, Tajo, and beyond Summary 8. Data Lifecycle Management What data lifecycle management is Importance of data lifecycle management Tools to help Building a tweet analysis capability Getting the tweet data Introducing Oozie A note on HDFS file permissions Making development a little easier Extracting data and ingesting into Hive A note on workflow directory structure Introducing HCatalog Using HCatalog The Oozie sharelib HCatalog and partitioned tables Producing derived data Performing multiple actions in parallel Calling a subworkflow Adding global settings Challenges of external data Data validation Validation actions Handling format changes Handling schema evolution with Avro Final thoughts on using Avro schema evolution Only make additive changes Manage schema versions explicitly. Think about schema distribution Collecting additional data Scheduling workflows Other Oozie triggers Pulling it all together Other tools to help Summary 9. Making Development Easier Choosing a framework Hadoop streaming Streaming word count in Python Differences in jobs when using streaming Finding important words in text Calculate term frequency Calculate document frequency Putting it all together - TF-IDF Kite Data Data Core Data HCatalog Data Hive Data MapReduce Data Spark Data Crunch Apache Crunch Getting started Concepts Data serialization Data processing patterns Aggregation and sorting Joining data Pipelines implementation and execution SparkPipeline MemPipeline Crunch examples Word co-occurrence TF-IDF Kite Morphlines Concepts Morphline commands Summary 10. Running a Hadoop Cluster I'm a developer - I don't care about operations! Hadoop and DevOps practices Cloudera Manager To pay or not to pay Cluster management using Cloudera Manager Cloudera Manager and other management tools Monitoring with Cloudera Manager Finding configuration files Cloudera Manager API Cloudera Manager lock-in Ambari - the open source alternative Operations in the Hadoop 2 world Sharing resources Building a physical cluster Physical layout Rack awareness Service layout Upgrading a service Building a cluster on EMR Considerations about filesystems Getting data into EMR EC2 instances and tuning Cluster tuning JVM considerations The small files problem Map and reduce optimizations Security Evolution of the Hadoop security model Beyond basic authorization The future of Hadoop security Consequences of using a secured cluster Monitoring. Hadoop - where failures don't matter.
Description based on publisher supplied metadata and other sources.