Data Lake for Enterprises
上QQ阅读APP看书,第一时间看更新

Sqoop connectors

Sqoop connector allows Sqoop job to:

  • Connect to the desired database system (import and export)
  • Extract data from the database system (export) and
  • Load the data to the database system (import)

Apache Sqoop allows itself to be extended in the form of having the capability of plugin codes, which is specialized in data transfer with a particular database system. This capability is a part of Sqoop’s extension framework and can be added to any installation of Sqoop. Sqoop 1 does have this capability and Sqoop 2 extends this aspect even further and adds many new features (the comparison section before has covered this aspect). Sqoop 2 has better integration using well defined connector API’s.

For transferring data when Sqoop is invoked, two components come into play, namely:

  • Driver: JDBC is one of the main mechanisms for Sqoop to connect to a RDBMS. The driver in purview of Sqoop refers to JDBC driver. JDBC is a specification given by Java Development Kit (JDK) consisting of various abstract classes and interfaces. Any RDBMS for connecting to them provides drivers complying with the JDBC specification. These drivers are proprietary and often have licenses associated with it, based on which this could be used. For Sqoop to work, these drivers need to be installed as the case may be by individual users and then used. Since these drivers are written by the database system providers it would be written with utmost care to be highly performant and efficiency in mind.
  • Connector: For a Sqoop job to run, it requires metadata of the data which needs to be transferred. Connector helps to retrieve these metadata and aids in transferring data (import and export) in the most efficient manner possible. JDBC is one of the main mechanisms and uses SQL language for data extraction and load; but each database systems would have certain hacks called as dialects. Connector uses these dialects to efficiently transfer data. Sqoop ships with a default JDBC connector (generic), which works with JDBC and SQL-compliant database systems; but due to its generic nature, it may not be the most optimal way of transferring data. There are other built-in connectors and external specialized connectors, which will be discussed in detail in the following subsection.

The figure (Figure 08) shows how these components are used by the Sqoop client to get a connection and thereafter use this connection object to transfer data from and to the database system:

Figure 08: Sqoop Connector components and its working

In the case of Sqoop 1, when a command is executed, Sqoop first analyses the command-line arguments and scans the Sqoop installation for the most apt (efficient and better performing) connector. It does scan both the built-in and manually installed connectors while choosing the best possible option. If it is not able to find right connector, as a last resort, it uses the built-in generic JDBC connector. Once it selects a connector, it looks for the best driver, and mostly there is a specialized driver tagged to a connection and database system to choose from. In case of generic JDBC driver, however, the driver has to be explicitly supplied using the command-line parameters.

One of the difference between Sqoop 2 is that in Sqoop 2, the connector has to be explicitly selected as against implicit selection in Sqoop 1.