parsing - How to load in Spark 2.1 XML file with repeating tags and attributes? -
i have xml file such structure:
<?xml version="1.0"?> <catalog> <new> <book id="bk101" language="en"> <author id="4452" primary="true">gambardella, matthew</author> <title primary="true">xml developer's guide</title> <genre primary="false">computer</genre> <publisher primary="true" id="us124"> <firm id="4124">amazon llc</firm> <address>ny, usa</address> <email type="official">books@amazon.com</email> <contact_person id="3351"> <name>rajesh k.</name> <email type="personal">rajesh@amazon.com</email> </contact_person> </publisher> </book> <book id="bk103" language="en"> <author id="4452" primary="true">corets, eva</author> <title primary="true">maeve ascendant</title> <genre primary="false">fantasy</genre> <publisher primary="true" id="us136"> <firm id="4524">oreally llc</firm> <address>ny, usa</address> <email type="official">books@oreally.com</email> <contact_person id="1573"> <name>prajakta g.</name> <email type="personal">prajakta@oreally.com</email> </contact_person> </publisher> </book> </new> <removed> <book id="bk104" language="en"> <author id="4452" primary="true">corets, eva</author> <title primary="true">oberon's legacy</title> <genre primary="false">fantasy</genre> <publisher primary="true" id="us137"> <firm id="4524">oreally llc</firm> <address>ny, usa</address> <email type="official">books@oreally.com</email> <contact_person id="1573"> <name>prajakta g.</name> <email type="personal">prajakta@oreally.com</email> </contact_person> </publisher> </book> </removed> </catalog>
how load in dataset? tried follow example from databricks, received error: analysysexception: reference '_id' ambiguous, be: _id#1, _id#3
i've replaced in structtype schema structfield '_id' '_id#1', '_id#2' , on,
but received error:
exception in thread "main" java.lang.exceptionininitializererror @ org.apache.spark.sparkcontext.withscope(sparkcontext.scala:701) @ org.apache.spark.sparkcontext.newapihadoopfile(sparkcontext.scala:1094) @ com.databricks.spark.xml.util.xmlfile$.withcharset(xmlfile.scala:46) @ com.databricks.spark.xml.defaultsource$$anonfun$createrelation$1.apply(defaultsource.scala:62) @ com.databricks.spark.xml.defaultsource$$anonfun$createrelation$1.apply(defaultsource.scala:62) @ com.databricks.spark.xml.xmlrelation.buildscan(xmlrelation.scala:54) @ com.databricks.spark.xml.xmlrelation.buildscan(xmlrelation.scala:63) @ org.apache.spark.sql.execution.datasources.datasourcestrategy$$anonfun$12.apply(datasourcestrategy.scala:343) @ org.apache.spark.sql.execution.datasources.datasourcestrategy$$anonfun$12.apply(datasourcestrategy.scala:343) @ org.apache.spark.sql.execution.datasources.datasourcestrategy$$anonfun$prunefilterproject$1.apply(datasourcestrategy.scala:384) @ org.apache.spark.sql.execution.datasources.datasourcestrategy$$anonfun$prunefilterproject$1.apply(datasourcestrategy.scala:383) @ org.apache.spark.sql.execution.datasources.datasourcestrategy$.prunefilterprojectraw(datasourcestrategy.scala:464) @ org.apache.spark.sql.execution.datasources.datasourcestrategy$.prunefilterproject(datasourcestrategy.scala:379) @ org.apache.spark.sql.execution.datasources.datasourcestrategy$.apply(datasourcestrategy.scala:339) @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$1.apply(queryplanner.scala:62) @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$1.apply(queryplanner.scala:62) @ scala.collection.iterator$$anon$12.nextcur(iterator.scala:434) @ scala.collection.iterator$$anon$12.hasnext(iterator.scala:440) @ scala.collection.iterator$$anon$12.hasnext(iterator.scala:439) @ org.apache.spark.sql.catalyst.planning.queryplanner.plan(queryplanner.scala:92) @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$2$$anonfun$apply$2.apply(queryplanner.scala:77) @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$2$$anonfun$apply$2.apply(queryplanner.scala:74) @ scala.collection.traversableonce$$anonfun$foldleft$1.apply(traversableonce.scala:157) @ scala.collection.traversableonce$$anonfun$foldleft$1.apply(traversableonce.scala:157) @ scala.collection.iterator$class.foreach(iterator.scala:893) @ scala.collection.abstractiterator.foreach(iterator.scala:1336) @ scala.collection.traversableonce$class.foldleft(traversableonce.scala:157) @ scala.collection.abstractiterator.foldleft(iterator.scala:1336) @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$2.apply(queryplanner.scala:74) @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$2.apply(queryplanner.scala:66) @ scala.collection.iterator$$anon$12.nextcur(iterator.scala:434) @ scala.collection.iterator$$anon$12.hasnext(iterator.scala:440) @ org.apache.spark.sql.catalyst.planning.queryplanner.plan(queryplanner.scala:92) @ org.apache.spark.sql.execution.queryexecution.sparkplan$lzycompute(queryexecution.scala:79) @ org.apache.spark.sql.execution.queryexecution.sparkplan(queryexecution.scala:75) @ org.apache.spark.sql.execution.queryexecution.executedplan$lzycompute(queryexecution.scala:84) @ org.apache.spark.sql.execution.queryexecution.executedplan(queryexecution.scala:84) @ org.apache.spark.sql.dataset.withtypedcallback(dataset.scala:2791) @ org.apache.spark.sql.dataset.head(dataset.scala:2112) @ org.apache.spark.sql.dataset.take(dataset.scala:2327) @ org.apache.spark.sql.dataset.showstring(dataset.scala:248) @ org.apache.spark.sql.dataset.show(dataset.scala:636) @ org.apache.spark.sql.dataset.show(dataset.scala:595)
solution found solved second error:
add in
.pom
file old jackson version: jackson-core v.2.6.7 , jackson-databind 2.6.7.
Comments
Post a Comment